Training English-German WMT15 NMT engine


(jean.senellart) #1
  • Create project directory:
$ mkdir wmt15-ende
$ cd wmt15-ende
  • Download wmt15 corpus:
$ wget "https://s3.amazonaws.com/opennmt-trainingdata/wmt15-de-en.tgz"
Résolution de s3.amazonaws.com (s3.amazonaws.com)… 54.231.121.18
Connexion à s3.amazonaws.com (s3.amazonaws.com)|54.231.121.18|:443… connecté.
requête HTTP transmise, en attente de la réponse… 200 OK
Taille : 473813197 (452M) [application/x-compressed]
Enregistre : «wmt15-de-en.tgz»

100%[==============================================================>] 473 813 197 3,59MB/s   ds 2m 2s

2016-12-23 17:04:31 (3,71 MB/s) - «wmt15-de-en.tgz» enregistré [473813197/473813197]
$ tar xzf wmt15-de-en.tgz
  • Get OpenNMT:
$ git clone https://github.com/OpenNMT/OpenNMT.git
Clonage dans 'OpenNMT'...
remote: Counting objects: 6117, done.
remote: Compressing objects: 100% (47/47), done.
remote: Total 6117 (delta 19), reused 0 (delta 0), pack-reused 6070
Réception d'objets: 100% (6117/6117), 14.21 MiB | 664.00 KiB/s, fait.
Résolution des deltas: 100% (4101/4101), fait.  
$ cd OpenNMT
$ luarocks install tds
  • Tokenize the corpus
$ for f in ../wmt15-de-en/*.?? ; do th tools/tokenize.lua < $f > $f.tok ; done
Tokenization completed in 382.013 seconds - 2399123 sentences
Tokenization completed in 348.492 seconds - 2399123 sentences
Tokenization completed in 385.031 seconds - 1920209 sentences
Tokenization completed in 304.141 seconds - 1920209 sentences
Tokenization completed in 40.293 seconds - 216190 sentences
Tokenization completed in 32.668 seconds - 216190 sentences
Tokenization completed in 0.434 seconds - 3000 sentences
Tokenization completed in 0.417 seconds - 3000 sentences
  • Concatenate commoncrawl, europarl and news-commentary:
$ for l in en de ; do cat ../wmt15-de-en/commoncrawl.de-en.$l.tok ../wmt15-de-en/europarl-v7.de-en.$l.tok ../wmt15-de-en/news-commentary-v10.de-en.$l.tok > ../wmt15-de-en/wmt15-all-de-en.$l.tok ; done
  • Preprocess the corpus:
$ th preprocess.lua -train_src ../wmt15-de-en/wmt15-all-de-en.en.tok -train_tgt ../wmt15-de-en/wmt15-all-de-en.de.tok -valid_src ../wmt15-de-en/newstest2013.en.tok -valid_tgt ../wmt15-de-en/newstest2013.de.tok -save_data ../wmt15-de-en/wmt15-all-en-de
Building source vocabulary...
Created dictionary of size 50004 (pruned from 1110036)   
Building target vocabulary...   
Created dictionary of size 50004 (pruned from 2158804)  
... 100000 sentences prepared   
... 200000 sentences prepared   
... 300000 sentences prepared   
... 400000 sentences prepared   
... 500000 sentences prepared   
[...]
... 4100000 sentences prepared  
... 4200000 sentences prepared  
... 4300000 sentences prepared  
... 4400000 sentences prepared  
... 4500000 sentences prepared  
... shuffling sentences 
... sorting sentences by size   
Prepared 4143915 sentences (391607 ignored due to length == 0 or > 50)  

Preparing validation data...
... shuffling sentences 
... sorting sentences by size   
Prepared 2891 sentences (109 ignored due to length == 0 or > 50)

Saving source vocabulary to '../wmt15-de-en/wmt15-all-en-de.src.dict'...   
Saving target vocabulary to '../wmt15-de-en/wmt15-all-en-de.tgt.dict'...   
Saving data to '../wmt15-de-en/wmt15-all-en-de-train.t7'...
  • Launch the training on the first GPU (check which GPU is available using nvidia-smi)
th train.lua -data ../wmt15-de-en/wmt15-all-en-de-train.t7 -save_model ../wmt15-de-en/wmt15-all-en-de -gpuid 1
Loading data from '../data/wmt15-de-en/wmt15-all-en-de-train.t7'...
 * vocabulary size: source = 50004; target = 50004
 * additional features: source = 0; target = 0  
 * maximum sequence length: source = 50; target = 51
 * number of training sentences: 4143915
 * maximum batch size: 64
Building model...
 * using input feeding  
Initializing parameters...
 * number of parameters: 84814004
Preparing memory optimization...
 * sharing 69% of output/gradInput tensors memory between clones
Start training...

Epoch 1 ; Iteration 50/64773 ; Learning rate 1.0000 ; Source tokens/s 166 ; Perplexity 186592.55
[...]
  • Wait… (about 2 days on a server with recent GPU card)

Detokenization clarification
Continue Training on pre-trained WMT15 model
Completely new to NMT
Tokenize.lua results in
Quickstart - Should the output from the quick start guide be quite this inaccurate?
#2

Continue to decode with:

th translate.lua -model /path/to/trained/model/wmt15-all-en-de_epoch13_7.28.t7 -src /path/to/testset/newstest2014.en.tok -output pred.txt -gpuid 1

Evaluations on BLEU case-sensitive on newstest2014:

OpenNMT version 9078243cf8 with all default parameters:
    Corpus Prep: src&tgt vocab size: 50K; src&tgt seq length: 50; shuffle: yes; seed: 3435
    Training: 2 layers; RNN 500; word vec size 500; input feed: yes; 13 epochs
    Training speed: about 430 min/epoch (GeForce GTX 1080)


(Vincent Nguyen) #3

I am having the same “curve” for EN-FR and this is not good for these corpora.

You need to start decay at epoch 6 instead of the default 9.

You will slightly better results I guess.


(Dwmcqueen) #4

Hi - the file “https://s3.amazonaws.com/opennmt-models/wmt15-de-en.tgz” says permissions denied when I try to retrieve it.


(jean.senellart) #5

@dwmcqueen - thanks for your feedback, the file was moved to https://s3.amazonaws.com/opennmt-trainingdata/wmt15-de-en.tgz, I fixed the tutorial, can you try again?


(wang) #6

Hi, when I tokenized the corpus, I got error messages below:
"[thread 1 callback] ./tools/utils/unicode.lua:4: module ‘bit32’ not found:"
Can anyone kindly tell me what the problem is ?thx.


(Etienne Monneret) #7

Try:
luarocks install bit32
:wink:


(wang) #8

Thanks a lot,it works now.:smiley:


(Ganesh) #9

When I test accuracy of the model on newstest2013 dataset, the perplexity scores I see are different than what is reported above.

th translate.lua -model models/onmt_baseline_wmt15-all.en-de_epoch13_7.19_release.t7 -src …/newstest/dev/newstest2013.en.tok -tgt …/newstest/dev/newstest2013.de.tok -output pred.txt -gpuid 1 | grep PPL
[02/08/17 01:27:08 INFO] PRED AVG SCORE: -0.4954, PRED PPL: 1.6412
[02/08/17 01:27:08 INFO] GOLD AVG SCORE: -2.2298, GOLD PPL: 9.2984

What is the difference between PRED vs GOLD PPL? Neither one seems to be the 7.19 score shown above.
Any ideas on what might be different/wrong with my setup which is leading to these different scores?
Thanks for the help.
-Ganesh


(Guillaume Klein) #11
  • PRED PPL is the perplexity of the model’s own predictions
  • GOLD PPL is the perplexity of the gold data according to your model

As you ran the translation on the validation set, GOLD PPL should be 7.19. However, they are not computed the same way during the training and translation. I think the difference is mostly due to the padding that is taken into account during training but not during translation.


(Ganesh) #12

Thank you for the response. So just to be sure, I should be looking at GOLD PPL, not so much the PRED PPL for my model’s accuracy?
Thanks again for the explanation.
Ganesh


(Guillaume Klein) #13

Yes but when you have gold data you usually care more about BLEU score. PRED PPL is not very useful because it is expected that the model has high confidence in its own predictions.


(Ganesh) #14

Does the translate.lua script (or any other script in ONMT) print the BLEU score given the gold data?


Japanese training
(Guillaume Klein) #15

No, it does not. You just need to get the multi-bleu.perl script:


(Ganesh) #16

Thanks - that is super helpful. I will take a look.

Thanks
Ganesh


(Minjae Lee) #17

Hello

I’ve followed thankful instructions in this tutorial, but have a problem.
(also added -mode aggressive option during tokenization to use open model parameters)

The problem I have is similar to what livenletdie had above.
I tried test to evaluate given open model by command bellow

th translate.lua -model onmt_baseline_wmt15-all.en-de_epoch13_7.19_release.t7 -src data/wmt15-de-en/newstest2013.en.tok -tgt data/wmt15-de-en/newstest2013.de.tok -output pred_new.txt -gpuid 3

What I’ve got in the end is:
[02/13/17 19:12:38 INFO] PRED AVG SCORE: -0.46, PRED PPL: 1.59
[02/13/17 19:12:38 INFO] GOLD AVG SCORE: -inf, GOLD PPL: inf

I presumed that something is wrong with the golden data “newstest2013.de” and retried the preprocess steps several times but could not solve the problem.
(I get bad scores on the golden data, sometimes over -100)

Are there any possible solutions?

Thanks in advance for your kind help.


(Guillaume Klein) #18

Oh there was actually a small error when reporting the final score on the gold data. You may want to update the project and retry.


(Minjae Lee) #19

It works now!
Thankyou very much.


(Guillaume Klein) #20

4 posts were split to a new topic: Issues when running the English-German WMT15 training


(Lifeng Dong) #22

Hi, how can I make the -valid_src/-valid_tgt resource used with preprocess.lua command? Are they required by the command? Can I just use a subset of the -train_src/-train_tgt resource as it? Thanks!