Training Romance Multi-Way model

(jean.senellart) #1

Multi-Way Neural Machine Translation Model

Following (Toward Multilingual Neural Machine Translation with Universal Encoder and Decoder, Thanh-Le Ha 2016) and (Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation, Johnson 2016), this tutorial shows how to train a multi-source and multi-target NMT model.

We have chosen here Spanish, Italian, French, Portuguese and Romanian: all 5 languages are part of the same “Romance” language family (we could have extended to Catalan) - these languages are very close and are sharing a lot of properties and even vocabulary. The goal is therefore to train a model translating from any of these 5 languages to any of the target languages (20 language pairs).

For the first experiment, we have extracted from Europarl, GlobalVoices and TedTalk corpus, 200,000 fully aligned sentences (each sentence has its translation in the 4 other languages) - the source corpus are available from here.

Considering proximity between the languages, we are using a common BPE between all languages (see Neural Machine Translation of Rare Words with Subword Units, Senrich 2016) - this allows to reduce the size of the vocabulary and handle translation of rare words.

We provide the corpus selection here containing the 2x20 training files (train-xxyy.{xx,yy}), 20 validation files (valid-xxyy.{xx,yy}), and 20 test files (test-xxyy.{xx,yy}). You can skip corpus preparation and directly get tokenized corpus from here.

Training BPE model

  • First tokenization of the corpus to learn BPE (we only take training corpus) with a 32K vocabulary:
for f in ${DATA}/train*.?? ; do echo "--- tokenize $f for BPE" ; th tools/tokenize.lua < $f > $f.rawtok ; done
  • Training of BPE model using script provided by Senrich here:
cat ${DATA}/train*.rawtok | python -s 32000 > ${DATA}/esfritptro.bpe32000

Tokenization with the BPE model

  • We retokenize the complete corpus (train, valid, test) using the BPE model, and inserting the tokenization annotations:
for f in ${DATA}/*-????.?? ; do echo "--- tokenize $f" ; th tools/tokenize.lua -joiner_annotate -bpe_model ${DATA}/esfritptro.bpe32000 < $f > $f.tok ; done

the corpus looks like that:

  • (Spanish) Pens■ amos que tal vez es una mez■ c■ la de factores ■.
  • (French) Nous pensons que c ■’■ est peut-être une combin■ aison de facteurs ■.

Adding language token

The trick in the training of a multiway system is for a given sentence pair, to pass information to the NN about source and target language so that we can control the target language during translation. There are multiple ways to do that, here we add tokens in the source sentence marking the source and target sentence language. We could omit marking of the source language token and the model will automatically learn to identify the language (see below).

Practically, we are doing that by adding at the beginning of each sentence the following tokens: __opt_src_xx __opt_tgt_yy as following:

for src in es fr it pt ro ; do 
  for tgt in es fr it pt ro ; do
    perl -i.bak -pe "s//__opt_src_${src} __opt_tgt_${tgt} /" *-${src}${tgt}.${src}.tok

Each sentence pair is now looking like:

  • (Spanish) __opt_src_es __opt_tgt_fr Pens■ amos que tal vez es una mez■ c■ la de factores ■.
  • (French) Nous pensons que c ■’■ est peut-être une combin■ aison de facteurs ■.


  • Final preparation step is to gather the training corpus together:
for src in es fr it pt ro ; do 
  for tgt in es fr it pt ro ; do
    cat ${DATA}/train-${src}${tgt}.${src}.tok >> ${DATA}/train-multi.src.tok
    cat ${DATA}/train-${src}${tgt}.${tgt}.tok >> ${DATA}/train-multi.tgt.tok
  • And prepare a random 2000 files for the validation:
for src in es fr it pt ro ; do 
  for tgt in es fr it pt ro ; do
    cat ${DATA}/valid-${src}${tgt}.${src}.tok >> ${DATA}/valid-multi.src.tok
    cat ${DATA}/valid-${src}${tgt}.${tgt}.tok >> ${DATA}/valid-multi.tgt.tok
paste ${DATA}/valid-multi.src.tok ${DATA}/valid-multi.tgt.tok | shuf > ${DATA}/valid-multi.srctgt.tok
head -2000 ${DATA}/valid-multi.srctgt.tok | cut -f1 > ${DATA}/valid-multi2000.src.tok
head -2000 ${DATA}/valid-multi.srctgt.tok | cut -f2 > ${DATA}/valid-multi2000.tgt.tok
  • Preprocess and Train

Here we simply train a model using brnn, 4 layers, 1000 for RNN size, bidirectional RNN, and 600 word embedding size:

th preprocess.lua -train_src ${DATA}/train-multi.src.tok -train_tgt ${DATA}/train-multi.tgt.tok \ 
    -valid_src ${DATA}/valid-multi2000.src.tok -valid_tgt ${DATA}/valid-multi2000.tgt.tok\
    -save_data ${DATA}/esfritptro-multi

th train.lua -layers 4 -rnn_size 1000 -brnn -word_vec_size 600 -data ${DATA}/esfritptro-multi-train.t7 \
    -save_model ${DATA}/onmt_esfritptro-4-1000-600 -gpuid 1

Trained model is available here.

Evaluating the model and comparing with single trainings

The following table gives the score of each single language pair and the second number compares with the score on the same language pair with a model trained only with the language pair data (200000 each).

It is interesting to see that the multi-way model is systematically performing better than each of the individual system trained with the same data. Here since the corpus is fully aligned, the training does not provide any single new source or target sentence for a given language pair, so the systematic gain is coming from the “rules” learnt/consolidated from/with other languages.

Training English-German WMT15 NMT engine
Model's retraining
Model with bidirectional translation
Sentence length & translation quality
Bi-directional language model
(Sergey Zhitansky) #2


Very interesting experiment. Can you describe how you merge sources that each sentence has its translation in the 4 other languages? I found only separated language pairs on Opus for Europarl.

And can you explain what options have to be used with translate.lua for definition of output language in case of multilingual model.

Thanks !

(jean.senellart) #3

For the merging - we extracted all of the bitext - aligned them on French and kept the sentences that had translations for all (or almost all) the 4 other languages to get fully parallel corpus.

In translation, you just need to add in the sentence the first 2 tokens: __opt_src_xx __opt_tgt_yy specifying source and target languages.

(Vincent Nguyen) #8

But then how do we control the target language at translation time ?

(jean.senellart) #9

I meant the source language token only - of course we do need the target language token as a selector - it was a typo.

Regarding source language token - in Johnson, 2016 - they say:

Note that we don’t specify the source language – the model will learn this automatically. Not specifying the source language has the potential disadvantage that words with the same spelling but different meaning from different source languages can be ambiguous to translate, but the advantage is that it is simpler and we can handle input with code-switching. We find that in almost all cases context provides enough language evidence to produce the correct translation.

(Vincent Nguyen) #10

also when you compare multi vs single:
for multi training did you use 50k vocab ie if same distribution over 5 languages about 10k vocab per language ?
single: did you take 10k or 50k vocab per language ?

(jean.senellart) #11

I am using the same BPE tokenizations in both cases - it is a 32K BPE so vocabulary size of 50K is covering everything.

(aurelien) #12


sorry for the newbie question but your confusion matrix score represents what?

  • The percentage of fully correctly translated sentences?
  • The percentage of words among sentences (at the correct place)?

(Vincent Nguyen) #13

these are BLEU scores.
google it:

(Daixie Wang) #14

Thank you @jean.senellart for an useful experiment. Just a side line question: Why are the two papers you posted very similar? What are the differences between them? Even their submitted dates are close!

(jean.senellart) #15

yes - I agree the similarity and simultaneousness of the papers is puzzling: mystery of the research world - the ideas are exactly the same, and I would say that the paper from KIT is going is a bit deeper in the analysis. The other paper has been trumpeted all over the internet by news media - so I just wanted to put a bit of balance here :slight_smile:

(Thanh-Le Ha) #16

Hi Daixiewang, I am one of the KIT paper’s authors. We were also surprised with the coincidence. We submitted the paper to COLING (deadline in July) and unfortunately it was not accepted. We blamed ourselves since we submitted the wrong version with quite a lot of spelling & grammar mistakes and without the reference entries. And when we saw the Google paper on arxiv, we immediately submitted ours, that’s why the dates. At that time, part of the paper was submitted and accepted to another conference. Of course, we would like to continue following this direction.

Kudos to Jean for this post. As a good news, we started using OpenNMT because of its fast, stability and robustness. Just bumped into this and I am happy to see such big improvements on Romance languages. Thank you.

(jean.senellart) #17

Hi @thanhleha ! Thank you for your comment and congratulation for this good paper - I am sorry it did not get accepted in July! We are very glad that you are joining the opennmt community, we will do our best to continue improving features and speed and looking forward your next papers!

(conghuyuan) #18

hello,I run run the script with the dealed datas you give.but the bleu is very low ,and I dont know why .I just run the script .so ,can you write a clear steps to follow? And the script at line 70 the parameter ‘-nparallel’ should be ‘-nparallel’.

(Vincent Nguyen) #19

I did that script.
Yes there is a typo on that -nparallel parameter.
I will fix it.
The Bleu score is not as high as in the forum post because I took a smaller network to make the training fast.

If you want to replicate the bleu score from the post you need to change the network size. but then you will have to wait longer …

(conghuyuan) #20

So,can i get the script result about the bleu ?

(Vincent Nguyen) #21

Well there 20 different scores …
for instance:
FR-ES: 30.40
ES-FR: 29.14
PT-RO: 25.14
PT-ES: 33.32


one naive question, how to test with BPE version NMT model? Is it like this: first BPE the test src and tgt sentences like training data, then translated with test src subword sentences, and use BLEU script to compare tgt subword sentences and translated subword sentences? But it seems not right to compare subword units, since BLEU should be used to words instead of subwords.

(Vincent Nguyen) #23

scoring is done at the word level.

check the recipe to understand how it’s done.

(Vincent Nguyen) #24

I will post a commit on this recipe because I used a detokenized bleu score vs the tuto above where scores are calculated on tokenized output.

Doing the calculation on tokenized output I get about the same scores as the tuto with a smaller network:

test-esfr_multibleu.txt:BLEU = 32.78, 61.8/41.2/29.8/22.1 (BP=0.911, ratio=0.915, hyp_len=15036, ref_len=16436)
test-esit_multibleu.txt:BLEU = 28.64, 58.2/36.5/25.1/17.8 (BP=0.918, ratio=0.921, hyp_len=13906, ref_len=15091)
test-espt_multibleu.txt:BLEU = 34.52, 64.5/43.2/31.5/23.5 (BP=0.911, ratio=0.914, hyp_len=13004, ref_len=14221)
test-esro_multibleu.txt:BLEU = 27.21, 57.2/34.9/23.9/16.8 (BP=0.909, ratio=0.913, hyp_len=13279, ref_len=14550)
test-fres_multibleu.txt:BLEU = 32.09, 62.8/40.8/28.8/20.7 (BP=0.912, ratio=0.916, hyp_len=13762, ref_len=15027)
test-frit_multibleu.txt:BLEU = 27.34, 57.3/35.3/24.0/16.6 (BP=0.912, ratio=0.916, hyp_len=13509, ref_len=14751)
test-frpt_multibleu.txt:BLEU = 30.27, 61.1/38.6/26.9/19.0 (BP=0.915, ratio=0.918, hyp_len=13148, ref_len=14323)
test-frro_multibleu.txt:BLEU = 25.40, 55.1/32.5/21.6/14.9 (BP=0.922, ratio=0.924, hyp_len=13116, ref_len=14188)
test-ites_multibleu.txt:BLEU = 30.11, 61.6/39.2/27.3/19.4 (BP=0.896, ratio=0.901, hyp_len=13241, ref_len=14698)
test-itfr_multibleu.txt:BLEU = 31.57, 60.5/39.7/28.9/21.5 (BP=0.902, ratio=0.907, hyp_len=14686, ref_len=16193)
test-itpt_multibleu.txt:BLEU = 28.02, 60.2/36.9/24.8/17.0 (BP=0.900, ratio=0.905, hyp_len=13700, ref_len=15137)
test-itro_multibleu.txt:BLEU = 23.82, 53.8/31.2/20.7/14.0 (BP=0.903, ratio=0.907, hyp_len=13265, ref_len=14623)
test-ptes_multibleu.txt:BLEU = 35.34, 65.3/43.7/31.7/23.2 (BP=0.928, ratio=0.931, hyp_len=13751, ref_len=14772)
test-ptfr_multibleu.txt:BLEU = 34.03, 63.3/42.8/31.6/23.8 (BP=0.900, ratio=0.905, hyp_len=14882, ref_len=16451)
test-ptit_multibleu.txt:BLEU = 28.18, 57.2/35.8/24.6/17.2 (BP=0.924, ratio=0.927, hyp_len=13547, ref_len=14618)
test-ptro_multibleu.txt:BLEU = 28.54, 57.8/36.1/24.8/17.3 (BP=0.927, ratio=0.930, hyp_len=13536, ref_len=14561)
test-roes_multibleu.txt:BLEU = 32.67, 63.8/41.6/29.6/21.4 (BP=0.907, ratio=0.911, hyp_len=13592, ref_len=14914)
test-rofr_multibleu.txt:BLEU = 33.01, 61.6/41.1/30.0/22.3 (BP=0.916, ratio=0.919, hyp_len=15274, ref_len=16622)
test-roit_multibleu.txt:BLEU = 26.97, 56.5/35.0/23.9/16.9 (BP=0.902, ratio=0.907, hyp_len=13590, ref_len=14986)
test-ropt_multibleu.txt:BLEU = 30.33, 61.7/39.1/27.0/18.9 (BP=0.911, ratio=0.914, hyp_len=13790, ref_len=15082)