Multilingual training experiments

vince62s · February 13, 2020, 7:53pm

We performed some interesting multilingual training with OpenNMT-py.

This first experiment includes 5 languages, 20 pairs (English, French, Italian, German, Spanish).

We used a “medium transformer” (6x768) and here are some example results:

Newstest13 (DE to FR):
Single pair model: 30.55
Google T: 28.25
Multilingual model: 30.40

Newstest19 (DE to FR):
Single pair model: 35.21
Google T: 32.18
Multilingual model: 34.60
Pivot with SOA engines DE-EN/EN-FR: 34.12

Newstest14 (FR to EN):
Single pair model: 41.3
Google T: 38.79
Multilingual model: 39.0

It performs quite well, it’s always above the pivot via EN when EN is not in the pair.

Next step is to try with more languages.

ArbinTimilsina · February 14, 2020, 2:14pm

Hi Vince,
Is this performed by adding src and tgt language token similar to how it’s done here? Can you share additional detail about the pre-processing and training procedure?

Regards

francoishernandez · February 14, 2020, 2:55pm

Hi,

Language token
We prepend a special token to each source, flagging the language of the target only (｟_tgt_is_XX_｠).

Preprocessing
We use BPE (48k merge operations) learned on an aggregation of samples of the different languages and corpora.

Training
Nothing special here, Transformer “medium” configuration (6 layers, 768 dim, 3072 ff), with shared encoder/decoder/generator parameters. Trained on 6 GPUs in FP16 mode, batches of approx. 50k tokens, results kept improving beyond 500k steps.

vince62s · February 21, 2020, 5:56pm

More info:
on EN-DE:
At the end of the multilingual training: NT14 EN-DE: 28.6
30K finetuning steps on EN-DE data only => 31.65
SOA EN-DE one pair Model: 32.64

same results on NT18:
43.5 multi => finetuning 45.9 (reference SOA single pair 47.8)

EN-DE was the pair for which the gap between multi and single pair was the biggest.

ArbinTimilsina · February 21, 2020, 8:33pm

Quick question: for these 20 pairs, is the dataset balanced? I mean, do each pair contain the same number of parallel sentences?

vince62s · February 21, 2020, 8:48pm

yes and no.
It is based on Multiparacrawl and Europarl.
They do not contain the same number of segments but we used weights to have a weight of 2 for Multiparacrawl vs 1 for Europarl.
Also we included some back translation for some pairs and not for others to measure the impact. Weight for BT was 1 (same as Europarl).

Bachstelze · March 3, 2020, 4:29pm

Did you try a deeper model to increase the capacity? It should be very beneficial for multilingual NMT. I am testing product keys with a multilingual and multi-domain model, but at the current state a deeper model is the way to go.

Bachstelze · June 25, 2020, 9:52am

Further research on deep and multilingual NMT could be helpful.

hui.li · June 27, 2025, 10:01am

@vince62s hi，Can you share train.yaml?I trained a multi-language translation. I added and to src and tgt, but the model I trained could not output the results correctly.