Training English-German WMT15 NMT engine

jean.senellart · December 23, 2016, 5:19pm

Create project directory:

$ mkdir wmt15-ende
$ cd wmt15-ende

Download wmt15 corpus:

$ wget "https://s3.amazonaws.com/opennmt-trainingdata/wmt15-de-en.tgz"

Résolution de s3.amazonaws.com (s3.amazonaws.com)… 54.231.121.18
Connexion à s3.amazonaws.com (s3.amazonaws.com)|54.231.121.18|:443… connecté.
requête HTTP transmise, en attente de la réponse… 200 OK
Taille : 473813197 (452M) [application/x-compressed]
Enregistre : «wmt15-de-en.tgz»

100%[==============================================================>] 473 813 197 3,59MB/s   ds 2m 2s

2016-12-23 17:04:31 (3,71 MB/s) - «wmt15-de-en.tgz» enregistré [473813197/473813197]

$ tar xzf wmt15-de-en.tgz

Get OpenNMT:

$ git clone https://github.com/OpenNMT/OpenNMT.git

Clonage dans 'OpenNMT'...
remote: Counting objects: 6117, done.
remote: Compressing objects: 100% (47/47), done.
remote: Total 6117 (delta 19), reused 0 (delta 0), pack-reused 6070
Réception d'objets: 100% (6117/6117), 14.21 MiB | 664.00 KiB/s, fait.
Résolution des deltas: 100% (4101/4101), fait.

$ cd OpenNMT
$ luarocks install tds

Tokenize the corpus

$ for f in ../wmt15-de-en/*.?? ; do th tools/tokenize.lua < $f > $f.tok ; done

Tokenization completed in 382.013 seconds - 2399123 sentences
Tokenization completed in 348.492 seconds - 2399123 sentences
Tokenization completed in 385.031 seconds - 1920209 sentences
Tokenization completed in 304.141 seconds - 1920209 sentences
Tokenization completed in 40.293 seconds - 216190 sentences
Tokenization completed in 32.668 seconds - 216190 sentences
Tokenization completed in 0.434 seconds - 3000 sentences
Tokenization completed in 0.417 seconds - 3000 sentences

Concatenate commoncrawl, europarl and news-commentary:

$ for l in en de ; do cat ../wmt15-de-en/commoncrawl.de-en.$l.tok ../wmt15-de-en/europarl-v7.de-en.$l.tok ../wmt15-de-en/news-commentary-v10.de-en.$l.tok > ../wmt15-de-en/wmt15-all-de-en.$l.tok ; done

Preprocess the corpus:

$ th preprocess.lua -train_src ../wmt15-de-en/wmt15-all-de-en.en.tok -train_tgt ../wmt15-de-en/wmt15-all-de-en.de.tok -valid_src ../wmt15-de-en/newstest2013.en.tok -valid_tgt ../wmt15-de-en/newstest2013.de.tok -save_data ../wmt15-de-en/wmt15-all-en-de

Building source vocabulary...
Created dictionary of size 50004 (pruned from 1110036)   
Building target vocabulary...   
Created dictionary of size 50004 (pruned from 2158804)  
... 100000 sentences prepared   
... 200000 sentences prepared   
... 300000 sentences prepared   
... 400000 sentences prepared   
... 500000 sentences prepared   
[...]
... 4100000 sentences prepared  
... 4200000 sentences prepared  
... 4300000 sentences prepared  
... 4400000 sentences prepared  
... 4500000 sentences prepared  
... shuffling sentences 
... sorting sentences by size   
Prepared 4143915 sentences (391607 ignored due to length == 0 or > 50)  

Preparing validation data...
... shuffling sentences 
... sorting sentences by size   
Prepared 2891 sentences (109 ignored due to length == 0 or > 50)

Saving source vocabulary to '../wmt15-de-en/wmt15-all-en-de.src.dict'...   
Saving target vocabulary to '../wmt15-de-en/wmt15-all-en-de.tgt.dict'...   
Saving data to '../wmt15-de-en/wmt15-all-en-de-train.t7'...

Launch the training on the first GPU (check which GPU is available using nvidia-smi)

th train.lua -data ../wmt15-de-en/wmt15-all-en-de-train.t7 -save_model ../wmt15-de-en/wmt15-all-en-de -gpuid 1

Loading data from '../data/wmt15-de-en/wmt15-all-en-de-train.t7'...
 * vocabulary size: source = 50004; target = 50004
 * additional features: source = 0; target = 0  
 * maximum sequence length: source = 50; target = 51
 * number of training sentences: 4143915
 * maximum batch size: 64
Building model...
 * using input feeding  
Initializing parameters...
 * number of parameters: 84814004
Preparing memory optimization...
 * sharing 69% of output/gradInput tensors memory between clones
Start training...

Epoch 1 ; Iteration 50/64773 ; Learning rate 1.0000 ; Source tokens/s 166 ; Perplexity 186592.55
[...]

Wait… (about 2 days on a server with recent GPU card)

Dakun · January 3, 2017, 3:45pm

Continue to decode with:

th translate.lua -model /path/to/trained/model/wmt15-all-en-de_epoch13_7.28.t7 -src /path/to/testset/newstest2014.en.tok -output pred.txt -gpuid 1

Evaluations on BLEU case-sensitive on newstest2014:

OpenNMT version 9078243cf8 with all default parameters:
    Corpus Prep: src&tgt vocab size: 50K; src&tgt seq length: 50; shuffle: yes; seed: 3435
    Training: 2 layers; RNN 500; word vec size 500; input feed: yes; 13 epochs
    Training speed: about 430 min/epoch (GeForce GTX 1080)

vince62s · January 8, 2017, 9:08am

I am having the same “curve” for EN-FR and this is not good for these corpora.

You need to start decay at epoch 6 instead of the default 9.

You will slightly better results I guess.

dwmcqueen · January 14, 2017, 11:13pm

Hi - the file “https://s3.amazonaws.com/opennmt-models/wmt15-de-en.tgz” says permissions denied when I try to retrieve it.

jean.senellart · January 15, 2017, 9:02am

@dwmcqueen - thanks for your feedback, the file was moved to https://s3.amazonaws.com/opennmt-trainingdata/wmt15-de-en.tgz, I fixed the tutorial, can you try again?

baohua · January 24, 2017, 7:54am

Hi, when I tokenized the corpus, I got error messages below:
"[thread 1 callback] ./tools/utils/unicode.lua:4: module ‘bit32’ not found:"
Can anyone kindly tell me what the problem is ?thx.

Etienne38 · January 24, 2017, 8:12am

Try:
luarocks install bit32

baohua · January 24, 2017, 3:42pm

Thanks a lot,it works now.

livenletdie · February 8, 2017, 1:44am

When I test accuracy of the model on newstest2013 dataset, the perplexity scores I see are different than what is reported above.

th translate.lua -model models/onmt_baseline_wmt15-all.en-de_epoch13_7.19_release.t7 -src …/newstest/dev/newstest2013.en.tok -tgt …/newstest/dev/newstest2013.de.tok -output pred.txt -gpuid 1 | grep PPL
[02/08/17 01:27:08 INFO] PRED AVG SCORE: -0.4954, PRED PPL: 1.6412
[02/08/17 01:27:08 INFO] GOLD AVG SCORE: -2.2298, GOLD PPL: 9.2984

What is the difference between PRED vs GOLD PPL? Neither one seems to be the 7.19 score shown above.
Any ideas on what might be different/wrong with my setup which is leading to these different scores?
Thanks for the help.
-Ganesh

guillaumekln · February 8, 2017, 5:05pm

PRED PPL is the perplexity of the model’s own predictions
GOLD PPL is the perplexity of the gold data according to your model

As you ran the translation on the validation set, GOLD PPL should be 7.19. However, they are not computed the same way during the training and translation. I think the difference is mostly due to the padding that is taken into account during training but not during translation.

livenletdie · February 9, 2017, 4:12pm

Thank you for the response. So just to be sure, I should be looking at GOLD PPL, not so much the PRED PPL for my model’s accuracy?
Thanks again for the explanation.
Ganesh

guillaumekln · February 9, 2017, 4:20pm

Yes but when you have gold data you usually care more about BLEU score. PRED PPL is not very useful because it is expected that the model has high confidence in its own predictions.

livenletdie · February 9, 2017, 4:22pm

Does the translate.lua script (or any other script in ONMT) print the BLEU score given the gold data?

guillaumekln · February 9, 2017, 4:23pm

No, it does not. You just need to get the multi-bleu.perl script:

github.com

moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl

#!/usr/bin/env perl
#
# This file is part of moses.  Its use is licensed under the GNU Lesser General
# Public License version 2.1 or, at your option, any later version.

# $Id$
use warnings;
use strict;

my $lowercase = 0;
if ($ARGV[0] eq "-lc") {
  $lowercase = 1;
  shift;
}

my $stem = $ARGV[0];
if (!defined $stem) {
  print STDERR "usage: multi-bleu.pl [-lc] reference < hypothesis\n";
  print STDERR "Reads the references from reference or reference0, reference1, ...\n";
  exit(1);

This file has been truncated. show original

livenletdie · February 9, 2017, 4:27pm

Thanks - that is super helpful. I will take a look.

Thanks
Ganesh

imnetizen · February 13, 2017, 10:27am

Hello

I’ve followed thankful instructions in this tutorial, but have a problem.
(also added -mode aggressive option during tokenization to use open model parameters)

The problem I have is similar to what livenletdie had above.
I tried test to evaluate given open model by command bellow

th translate.lua -model onmt_baseline_wmt15-all.en-de_epoch13_7.19_release.t7 -src data/wmt15-de-en/newstest2013.en.tok -tgt data/wmt15-de-en/newstest2013.de.tok -output pred_new.txt -gpuid 3

What I’ve got in the end is:
[02/13/17 19:12:38 INFO] PRED AVG SCORE: -0.46, PRED PPL: 1.59
[02/13/17 19:12:38 INFO] GOLD AVG SCORE: -inf, GOLD PPL: inf

I presumed that something is wrong with the golden data “newstest2013.de” and retried the preprocess steps several times but could not solve the problem.
(I get bad scores on the golden data, sometimes over -100)

Are there any possible solutions?

Thanks in advance for your kind help.

guillaumekln · February 21, 2017, 12:49pm

Oh there was actually a small error when reporting the final score on the gold data. You may want to update the project and retry.

https://github.com/OpenNMT/OpenNMT/commit/af47fa34710eb35d98e5b162ff0917aafc1b3411

imnetizen · February 22, 2017, 11:02pm

It works now!
Thankyou very much.

guillaumekln · February 23, 2017, 4:54pm

4 posts were split to a new topic: Issues when running the English-German WMT15 training

lifeng · March 7, 2017, 10:01am

Hi, how can I make the -valid_src/-valid_tgt resource used with preprocess.lua command? Are they required by the command? Can I just use a subset of the -train_src/-train_tgt resource as it? Thanks!