Multi-way Machine Translation

I have been trying to train a multi-way model after seeing this post. The model was trained properly without any errors. But during inference when i am tying to give a language pair which is unseen during training. The model is not able to translate. I have taken EN-DE and EN-IT pair during training and during inference i am trying to translate between DE-IT, IT-DE, EN-IT etc.
The data along with the generated test sets are attached below : zero-shot – Google Drive

my yml config:
model_dir: /content/drive/MyDrive/Data/en-af

train_features_file: /content/drive/MyDrive/Data/en-af/train.tag.src
train_labels_file: /content/drive/MyDrive/Data/en-af/train.tag.trg
eval_features_file: /content/drive/MyDrive/Data/en-af/valid.tag.src
eval_labels_file: /content/drive/MyDrive/Data/en-af/valid.tag.trg
source_vocabulary: /content/drive/MyDrive/Data/en-af/src.vocab.vocab
target_vocabulary: /content/drive/MyDrive/Data/en-af/trg.vocab.vocab
type: SentencePieceTokenizer
model: /content/drive/MyDrive/Data/en-af/src.vocab.model
type: SentencePieceTokenizer
model: /content/drive/MyDrive/Data/en-af/trg.vocab.model

batch_size: 16
save_checkpoints_steps: 1000
maximum_features_length: 50
maximum_labels_length: 50
batch_size: 4096
max_step: 50000
save_summary_steps: 100
batch_size: 16
steps: 1000
external_evaluators: BLEU
export_format: saved_model
average_loss_in_time: true
minimum_decoding_length: 1
batch_size: 32

Any kind of help would be greatly appreciated
Thank You

we implemented it in a not merged PR in OpenNMT-py

1 Like

Dear Amartya,

To build a multilingual NMT model, one way you can prepare your data is as follows:

Source Target
<ar> Thank you very much شكرا جزيلا
<es> Thank you very much Muchas gracias
<fr> Thank you very much Merci beaucoup
<hi> Thank you very much आपका बहुत बहुत धन्यवाद
<ar> आपका बहुत बहुत धन्यवाद شكرا جزيلا
<en> आपका बहुत बहुत धन्यवाद Thank you very much
<es> आपका बहुत बहुत धन्यवाद Muchas gracias
<fr> आपका बहुत बहुत धन्यवाद Merci beaucoup
<ar> Muchas gracias شكرا جزيلا
<en> Muchas gracias Thank you very much
<fr> Muchas gracias Merci beaucoup
<hi> Muchas gracias आपका बहुत बहुत धन्यवाद
<en> شكرا جزيلا Thank you very much
<es> شكرا جزيلا Muchas gracias
<fr> شكرا جزيلا Merci beaucoup
<hi> شكرا جزيلا आपका बहुत बहुत धन्यवाद
<ar> Merci beaucoup شكرا جزيلا
<en> Merci beaucoup Thank you very much
<es> Merci beaucoup Muchas gracias
<hi> Merci beaucoup आपका बहुत बहुत धन्यवाद

• Before training, make sure you shuffle the data.

• During vocabulary preparation, you have to add these prefix language tokens to your SentencePiece model through the option --user_defined_symbols

• If the data is clearly unbalanced, like you have 75 million sentences for Spanish and 15 million sentences for French, you have to balance it; otherwise, you would end up with a system that translates Spanish better than French. This technique is called over-sampling (or up-sampling). The obvious way to achieve it in NMT toolkits is through giving weights to your datasets. In this example, the Spanish dataset can take the weight of 1 while the French can take the weight of 5 because your Spanish dataset is 5 times larger than your French dataset.

• At inference time, if you want to translate English to French, you augment your input with the target code, for example: “<fr> Thank you very much”. Similarly, if you would like to translate Spanish to Hindi, your input will be “<hi> Muchas gracias”.

I hope this helps.

All the best,


Now that’s a pretty clear way of explaining it!

I’m curious to know if training such a model give better result during translation than training individual model for each language pair?

I believe I seen a paper somewhere that started: when you were doing many to one the results were better if you had less than billions of segments or what ever was that threshold.

But I never seen anything about many to many.

Best regards,

Hi Samuel!

The main reason why people build multilingual models is scalability. If they want to have an NMT system that would translate between 100 languages, they would need 100*100 bilingual models.

Regarding quality, according to the mBART paper, multilingual models are more useful for low-resource and medium-resource languages than they are for rich-resource languages (cf. Table 1 & Table 2).

In this WMT 21 shared task submission, the team reported that “Multilingual fine-tuning is better than bilingual fine-tuning” (cf. Section 4.1.3)

Kind regards,

1 Like

Hi @ymoslem thanks for the reply. I did exactly what you said but i am not getting proper translation during inference phase. Also during evaluation on validation set i found that there are some languages where the model is not able to translate. You can check from 1013-2024 the translation is completly wrong.

I would double check all your tokenized files. If the basic languages are not working… Chances are that there is something in your prep work that is not going to plan.

Last time something like that happened to me, i was using the wrong tokenizer. It’s also happen that i had forgoten to tokenize some of my training files , but I was detokenizing the output… the results was awfull.

Best regards,

That’s a good piece of advice.

It would also help to shed some light on data statistics and the building vocabulary process.

Also, in the config file, do you have two batch size values under train? Make sure of the batch_type option and select the value accordingly.

Kind regards,