Multilingual training experiments

We performed some interesting multilingual training with OpenNMT-py.

This first experiment includes 5 languages, 20 pairs (English, French, Italian, German, Spanish).

We used a “medium transformer” (6x768) and here are some example results:

Newstest13 (DE to FR):
Single pair model: 30.55
Google T: 28.25
Multilingual model: 30.40

Newstest19 (DE to FR):
Single pair model: 35.21
Google T: 32.18
Multilingual model: 34.60
Pivot with SOA engines DE-EN/EN-FR: 34.12

Newstest14 (FR to EN):
Single pair model: 41.3
Google T: 38.79
Multilingual model: 39.0

It performs quite well, it’s always above the pivot via EN when EN is not in the pair.

Next step is to try with more languages.

1 Like

Hi Vince,
Is this performed by adding src and tgt language token similar to how it’s done here? Can you share additional detail about the pre-processing and training procedure?

Regards

Hi,

Language token
We prepend a special token to each source, flagging the language of the target only (⦅_tgt_is_XX_⦆).

Preprocessing
We use BPE (48k merge operations) learned on an aggregation of samples of the different languages and corpora.

Training
Nothing special here, Transformer “medium” configuration (6 layers, 768 dim, 3072 ff), with shared encoder/decoder/generator parameters. Trained on 6 GPUs in FP16 mode, batches of approx. 50k tokens, results kept improving beyond 500k steps.

1 Like

More info:
on EN-DE:
At the end of the multilingual training: NT14 EN-DE: 28.6
30K finetuning steps on EN-DE data only => 31.65
SOA EN-DE one pair Model: 32.64

same results on NT18:
43.5 multi => finetuning 45.9 (reference SOA single pair 47.8)

EN-DE was the pair for which the gap between multi and single pair was the biggest.

Quick question: for these 20 pairs, is the dataset balanced? I mean, do each pair contain the same number of parallel sentences?

yes and no.
It is based on Multiparacrawl and Europarl.
They do not contain the same number of segments but we used weights to have a weight of 2 for Multiparacrawl vs 1 for Europarl.
Also we included some back translation for some pairs and not for others to measure the impact. Weight for BT was 1 (same as Europarl).

Did you try a deeper model to increase the capacity? It should be very beneficial for multilingual NMT. I am testing product keys with a multilingual and multi-domain model, but at the current state a deeper model is the way to go.

Further research on deep and multilingual NMT could be helpful.