I have been trying to train a multi-way model after seeing this post. The model was trained properly without any errors. But during inference when i am tying to give a language pair which is unseen during training. The model is not able to translate. I have taken EN-DE and EN-IT pair during training and during inference i am trying to translate between DE-IT, IT-DE, EN-IT etc.
The data along with the generated test sets are attached below : zero-shot – Google Drive
my yml config:
model_dir: /content/drive/MyDrive/Data/en-af
To build a multilingual NMT model, one way you can prepare your data is as follows:
Source
Target
<ar> Thank you very much
شكرا جزيلا
<es> Thank you very much
Muchas gracias
<fr> Thank you very much
Merci beaucoup
<hi> Thank you very much
आपका बहुत बहुत धन्यवाद
<ar> आपका बहुत बहुत धन्यवाद
شكرا جزيلا
<en> आपका बहुत बहुत धन्यवाद
Thank you very much
<es> आपका बहुत बहुत धन्यवाद
Muchas gracias
<fr> आपका बहुत बहुत धन्यवाद
Merci beaucoup
<ar> Muchas gracias
شكرا جزيلا
<en> Muchas gracias
Thank you very much
<fr> Muchas gracias
Merci beaucoup
<hi> Muchas gracias
आपका बहुत बहुत धन्यवाद
<en> شكرا جزيلا
Thank you very much
<es> شكرا جزيلا
Muchas gracias
<fr> شكرا جزيلا
Merci beaucoup
<hi> شكرا جزيلا
आपका बहुत बहुत धन्यवाद
<ar> Merci beaucoup
شكرا جزيلا
<en> Merci beaucoup
Thank you very much
<es> Merci beaucoup
Muchas gracias
<hi> Merci beaucoup
आपका बहुत बहुत धन्यवाद
• Before training, make sure you shuffle the data.
• During vocabulary preparation, you have to add these prefix language tokens to your SentencePiece model through the option --user_defined_symbols
• If the data is clearly unbalanced, like you have 75 million sentences for Spanish and 15 million sentences for French, you have to balance it; otherwise, you would end up with a system that translates Spanish better than French. This technique is called over-sampling (or up-sampling). The obvious way to achieve it in NMT toolkits is through giving weights to your datasets. In this example, the Spanish dataset can take the weight of 1 while the French can take the weight of 5 because your Spanish dataset is 5 times larger than your French dataset.
• At inference time, if you want to translate English to French, you augment your input with the target code, for example: “<fr> Thank you very much”. Similarly, if you would like to translate Spanish to Hindi, your input will be “<hi> Muchas gracias”.
I’m curious to know if training such a model give better result during translation than training individual model for each language pair?
I believe I seen a paper somewhere that started: when you were doing many to one the results were better if you had less than billions of segments or what ever was that threshold.
The main reason why people build multilingual models is scalability. If they want to have an NMT system that would translate between 100 languages, they would need 100*100 bilingual models.
Regarding quality, according to the mBART paper, multilingual models are more useful for low-resource and medium-resource languages than they are for rich-resource languages (cf. Table 1 & Table 2).
In this WMT 21 shared task submission, the team reported that “Multilingual fine-tuning is better than bilingual fine-tuning” (cf. Section 4.1.3)
Hi @ymoslem thanks for the reply. I did exactly what you said but i am not getting proper translation during inference phase. Also during evaluation on validation set i found that there are some languages where the model is not able to translate. You can check from 1013-2024 the translation is completly wrong.
I would double check all your tokenized files. If the basic languages are not working… Chances are that there is something in your prep work that is not going to plan.
Last time something like that happened to me, i was using the wrong tokenizer. It’s also happen that i had forgoten to tokenize some of my training files , but I was detokenizing the output… the results was awfull.