Training English-German WMT15 NMT engine


(Ganesh) #16

Thanks - that is super helpful. I will take a look.

Thanks
Ganesh


(Minjae Lee) #17

Hello

I’ve followed thankful instructions in this tutorial, but have a problem.
(also added -mode aggressive option during tokenization to use open model parameters)

The problem I have is similar to what livenletdie had above.
I tried test to evaluate given open model by command bellow

th translate.lua -model onmt_baseline_wmt15-all.en-de_epoch13_7.19_release.t7 -src data/wmt15-de-en/newstest2013.en.tok -tgt data/wmt15-de-en/newstest2013.de.tok -output pred_new.txt -gpuid 3

What I’ve got in the end is:
[02/13/17 19:12:38 INFO] PRED AVG SCORE: -0.46, PRED PPL: 1.59
[02/13/17 19:12:38 INFO] GOLD AVG SCORE: -inf, GOLD PPL: inf

I presumed that something is wrong with the golden data “newstest2013.de” and retried the preprocess steps several times but could not solve the problem.
(I get bad scores on the golden data, sometimes over -100)

Are there any possible solutions?

Thanks in advance for your kind help.


(Guillaume Klein) #18

Oh there was actually a small error when reporting the final score on the gold data. You may want to update the project and retry.


(Minjae Lee) #19

It works now!
Thankyou very much.


(Guillaume Klein) #20

4 posts were split to a new topic: Issues when running the English-German WMT15 training


(Lifeng Dong) #22

Hi, how can I make the -valid_src/-valid_tgt resource used with preprocess.lua command? Are they required by the command? Can I just use a subset of the -train_src/-train_tgt resource as it? Thanks!


(Guillaume Klein) #23

Yes.

Yes.

2000 sentences are enough for the validation set.


(Lifeng Dong) #24

I got it. Thanks a lot, Guillaume!


#26

Hi,

I want to translate with bidrectional model I trained, so I add -config, but it showed unknown option? Does it mean I don’t need to feed in any config?

 th translate.lua -config ../en-de_gru_brnn_4.txt \
   -model ../wmt15-de-en/en-de_gru_brnn_4/wmt15-en-de-gru-brnn-4_epoch1_42.90.t7 \
   -src ../wmt15-de-en/test/newstest2014-deen-src.en.tok -output ../epoch1.txt -gpuid 1
**/home/lijun/torch/install/bin/luajit: ./onmt/utils/ExtendedCmdLine.lua:198: unkown option brnn**

I use brnn = true to train the model. Did I use wrong option to run?


(jean.senellart) #27

Hi - you do not need to use -config for translation - the options are part of the model.
The error message says that -brnn is not a translate.lua option.


#28

Hi,

I tried to run bpe version with 2 layers, other settings are same as non-bpe version, but only got BLEU score on 18.69, wrt your report 19.34, it seems a big gap. Is there any point I should pay attention to?
Besides, I ran bi-lstm, and 4 layers, but only got 16.95 BLEU score, is it reasonable? I think 4 layers should be better than 2 layers (according to Google’s massive exploration on NMT).

Thanks very much.


(Vincent Nguyen) #29

maybe just provide the 2 command lines you used for your training.
that might to see what could lead to different reuslts.


#30

For 4 layers bi-lism, I made a config file with “brnn = True” and “layers = 4”, then
$ th train.lua -config xxx -data xxx -save_model xxx -gpuid 1
For bpe version, I did as tutorial “Training Romance Multi-Way model”, then ran the same command line.

Could some one provide a deep model running recipe? Thanks a lot.


(Wiktor Stribiżew) #31

@jean.senellart I am about to test ENDE training, and noticed you did not use -case_feature true at the tokenization step. Isn’t it a default action for English-German, as German noun capitalization issues may appear if all the input corpus is lowercased, or do I understand -case_feature option wrong?


(jean.senellart) #32

Hi Wiktor, yes in general using -case_feature is a good idea for almost all languages (with case) - and shows good performance for dealing and rendering in target case variation in the source (for instance all uppercase/lowercase, or capitalized words, …).
Here, you probably also want a BPE tokenization to better generate German compounds too.


Issues when running the English-German WMT15 training
(Wiktor Stribiżew) #33

@jean.senellart Thank you for bringing BPE tokenization to my attention. However, I am not sure I can find the right procedure, how to proceed here. Sorry, if these are basic questions, but I spent a whole day figuring out this:

  1. I guess BPE tokenization must be run first. I run for f in $FLDR/*.txt; do th tools/learn_bpe.lua -size 30000 -save_bpe $f.bpe < $f; done;. This results in several No pair has frequency > 1. Stopping messages, but the *.bpe files are created - is there anything I should worry about here?

    
    

[07/26/17 19:08:26 INFO] Building vocabulary from STDIN
[07/26/17 19:08:26 INFO] Geting pair statistics from vocabulary
[07/26/17 19:08:26 INFO] Generating merge operations to output
[07/26/17 19:08:27 INFO] … 1000 merge operations generated
[07/26/17 19:08:28 INFO] … 2000 merge operations generated
[07/26/17 19:08:29 INFO] … 3000 merge operations generated
[07/26/17 19:08:30 INFO] … 4000 merge operations generated
[07/26/17 19:08:31 INFO] … 5000 merge operations generated
[07/26/17 19:08:33 INFO] … 6000 merge operations generated
[07/26/17 19:08:35 INFO] … 7000 merge operations generated
No pair has frequency > 1. Stopping

  1. Looking at the .bpe files, I am not sure I should run BPE tokenization on English corpora. The EN .bpe file contains entries like wr ite, writ ten</w>, etc. Is that correct?

  2. Then I ran tools/tokenize.lua:

for f in $FLDR/*.txt; do th tools/tokenize.lua -bpe_model $FLDR/${f##*/}.bpe -joiner_annotate true -case_feature true < $f > $f.tok; done;

I guess these .tok files should be fed to the preprocess.lua script without any additional parameters, right?

  1. When I run preprocess.lua with th preprocess.lua -config cfg.txt (with the default 5 parameters), I get 2 more *.dict files: train2.target_feature_1.dict and train2.target_feature_1.dict - should I specify them anwhere in the train.lua script parameters? Or should I just run th train.lua -data $FLDR/ENDE-train.t7 -save_model ENDE_model?

Thank you.

PS I wrote $FLDR in the post, while I actually had to type the full literal folder path to the files, I have not figured out how to use variables in there.


(jean.senellart) #34

Hi Wiktor,

  1. the No pair has frequency > 1 message is not normal but I understand that you are running it on test/valid corpus too so it would explain. The bpe model should be trained on the training corpus only - and ideally, you train one single model for source and target so that the model learns easily to translate identical word fragments from source to target.
    So I would concatenate source and target training corpus - then train tokenize it once, then learn a bpe model on this single corpus, that you then use for tokenization of test/valid/train corpus in source and target.

See the following link for also a discussion about BPE and case handling http://opennmt.net/OpenNMT/tools/tokenization/#bpe

You have also a step by step in this tutorial:

yes

you don’t need to specify them - they are not used for the training (everything is prepared in the .t7 file).

Jean


(SeisQ) #36

My apologies if this is posted in the wrong place. It is about training a German-English model.

I tried to replicate the results of the benchmark model presented in the OpenNMT Models and Recipes page (http://opennmt.net/Models/). I followed a tokenization procedure as the one in the tutorial presented in this forum thread (but as it applied to DE->EN). As the Benchmark indicates, I only really changed the -mode parameter to aggressive and left everything else in the default settings:

$ for f in wmt15-de-en/*.?? ; do echo "— tokenize $f with Aggressive Tokenization "; th tools/tokenize.lua -save_config wmt15-de-en -mode aggressive < $f > $f.tok ; done

I then concatenated the corpora that would become the training dataset (ie. Raw Europarl v7 + Common Crawl + News Commentary v10), which created the German and English versions of wmt15-all-de-en.??.tok.

Afterwards I preprocessed the data to create the .t7 files needed for training and testing:

$ th preprocess.lua -save_config preprocessing_configs -keep_frequency -log_file preprocessing_log -train_src wmt15-all-de-en.de.tok -train_tgt wmt15-all-de-en.en.tok -valid_src newstest2013.de.tok -valid_tgt newstest2013.en.tok -save_data wmt15-all-de-en

Finally I ran the training session with the following command:

$ th train.lua -save_config training_configs -data wmt15-all-de-en-train.t7 -report_every 200 -save_model wmt15-de-en_aggressive_tok -exp wmt15_de_en_aggTOK -gpuid 1 -log_file training_log

The default settings, as I understand them, matched the requirements stated in the Benchmark report. So I didn’t change anything else. After training, this resulted in a 9.14 score on the newstest2013 validation set.

Did I overlook something? Could someone help point me in the right direction?

Thanks for the help.


(SeisQ) #37

Has anyone noticed that at least the commoncrawl data for German is not all purely German?

For example items 8~49 have English sentences, and those don’t even match the respective lines in the English corpus. At first I thought that it was a mistake on my part during processing, but it appears like this in the files provided here in OpenNMT (http://opennmt.net/Models/) and those in the WMT website (http://www.statmt.org/wmt17/translation-task.html). Does anyone know if there is a corrected version? As far as I can tell this has not been reported. If it is, where can I find it?

Thank you.


(Wiktor Stribiżew) #38

@seisqui I did not find Common Crawl corpus problematic, but I did have issues with the New Comments corpus (due to different number of source and target “sentences”) that I had to download as an XLIFF file from http://www.casmacat.eu/corpus/news-commentary.html.

You might want to run a cleanup on the corpora before using them for model building. As part of the regular cleanup steps, like removing untranslated segments, duplicates, etc., you may run a language verification check, against the source, target or both. E.g. I work on TMX files using our corporate tool and then convert them to TXT parallel corpora.