Training English-German WMT15 NMT engine

guillaumekln · March 7, 2017, 10:04am

Yes.

2000 sentences are enough for the validation set.

lifeng · March 7, 2017, 10:05am

I got it. Thanks a lot, Guillaume!

lijun_wu · March 27, 2017, 8:53am

Hi,

I want to translate with bidrectional model I trained, so I add -config, but it showed unknown option? Does it mean I don’t need to feed in any config?

 th translate.lua -config ../en-de_gru_brnn_4.txt \
   -model ../wmt15-de-en/en-de_gru_brnn_4/wmt15-en-de-gru-brnn-4_epoch1_42.90.t7 \
   -src ../wmt15-de-en/test/newstest2014-deen-src.en.tok -output ../epoch1.txt -gpuid 1
**/home/lijun/torch/install/bin/luajit: ./onmt/utils/ExtendedCmdLine.lua:198: unkown option brnn**

I use brnn = true to train the model. Did I use wrong option to run?

jean.senellart · March 27, 2017, 3:04pm

Hi - you do not need to use -config for translation - the options are part of the model.
The error message says that -brnn is not a translate.lua option.

lijun_wu · April 17, 2017, 2:36am

Hi,

I tried to run bpe version with 2 layers, other settings are same as non-bpe version, but only got BLEU score on 18.69, wrt your report 19.34, it seems a big gap. Is there any point I should pay attention to?
Besides, I ran bi-lstm, and 4 layers, but only got 16.95 BLEU score, is it reasonable? I think 4 layers should be better than 2 layers (according to Google’s massive exploration on NMT).

Thanks very much.

vince62s · April 17, 2017, 7:09pm

maybe just provide the 2 command lines you used for your training.
that might to see what could lead to different reuslts.

lijun_wu · April 18, 2017, 2:33am

For 4 layers bi-lism, I made a config file with “brnn = True” and “layers = 4”, then
$ th train.lua -config xxx -data xxx -save_model xxx -gpuid 1
For bpe version, I did as tutorial “Training Romance Multi-Way model”, then ran the same command line.

Could some one provide a deep model running recipe? Thanks a lot.

wiktor.stribizew · July 25, 2017, 2:27pm

@jean.senellart I am about to test ENDE training, and noticed you did not use -case_feature true at the tokenization step. Isn’t it a default action for English-German, as German noun capitalization issues may appear if all the input corpus is lowercased, or do I understand -case_feature option wrong?

jean.senellart · July 26, 2017, 12:44am

Hi Wiktor, yes in general using -case_feature is a good idea for almost all languages (with case) - and shows good performance for dealing and rendering in target case variation in the source (for instance all uppercase/lowercase, or capitalized words, …).
Here, you probably also want a BPE tokenization to better generate German compounds too.

wiktor.stribizew · July 26, 2017, 10:08pm

@jean.senellart Thank you for bringing BPE tokenization to my attention. However, I am not sure I can find the right procedure, how to proceed here. Sorry, if these are basic questions, but I spent a whole day figuring out this:

I guess BPE tokenization must be run first. I run for f in $FLDR/*.txt; do th tools/learn_bpe.lua -size 30000 -save_bpe $f.bpe < $f; done;. This results in several No pair has frequency > 1. Stopping messages, but the *.bpe files are created - is there anything I should worry about here?

[07/26/17 19:08:26 INFO] Building vocabulary from STDIN
[07/26/17 19:08:26 INFO] Geting pair statistics from vocabulary
[07/26/17 19:08:26 INFO] Generating merge operations to output
[07/26/17 19:08:27 INFO] … 1000 merge operations generated
[07/26/17 19:08:28 INFO] … 2000 merge operations generated
[07/26/17 19:08:29 INFO] … 3000 merge operations generated
[07/26/17 19:08:30 INFO] … 4000 merge operations generated
[07/26/17 19:08:31 INFO] … 5000 merge operations generated
[07/26/17 19:08:33 INFO] … 6000 merge operations generated
[07/26/17 19:08:35 INFO] … 7000 merge operations generated
No pair has frequency > 1. Stopping

Looking at the .bpe files, I am not sure I should run BPE tokenization on English corpora. The EN .bpe file contains entries like wr ite, writ ten</w>, etc. Is that correct?
Then I ran tools/tokenize.lua:

for f in $FLDR/*.txt; do th tools/tokenize.lua -bpe_model $FLDR/${f##*/}.bpe -joiner_annotate true -case_feature true < $f > $f.tok; done;

I guess these .tok files should be fed to the preprocess.lua script without any additional parameters, right?

When I run preprocess.lua with th preprocess.lua -config cfg.txt (with the default 5 parameters), I get 2 more *.dict files: train2.target_feature_1.dict and train2.target_feature_1.dict - should I specify them anwhere in the train.lua script parameters? Or should I just run th train.lua -data $FLDR/ENDE-train.t7 -save_model ENDE_model?

Thank you.

PS I wrote $FLDR in the post, while I actually had to type the full literal folder path to the files, I have not figured out how to use variables in there.

jean.senellart · July 27, 2017, 12:54pm

Hi Wiktor,

the No pair has frequency > 1 message is not normal but I understand that you are running it on test/valid corpus too so it would explain. The bpe model should be trained on the training corpus only - and ideally, you train one single model for source and target so that the model learns easily to translate identical word fragments from source to target.
So I would concatenate source and target training corpus - then train tokenize it once, then learn a bpe model on this single corpus, that you then use for tokenization of test/valid/train corpus in source and target.

See the following link for also a discussion about BPE and case handling http://opennmt.net/OpenNMT/tools/tokenization/#bpe

You have also a step by step in this tutorial:

yes

you don’t need to specify them - they are not used for the training (everything is prepared in the .t7 file).

Jean

seisqui · August 14, 2017, 8:08am

My apologies if this is posted in the wrong place. It is about training a German-English model.

I tried to replicate the results of the benchmark model presented in the OpenNMT Models and Recipes page (http://opennmt.net/Models/). I followed a tokenization procedure as the one in the tutorial presented in this forum thread (but as it applied to DE->EN). As the Benchmark indicates, I only really changed the -mode parameter to aggressive and left everything else in the default settings:

$ for f in wmt15-de-en/*.?? ; do echo "— tokenize $f with Aggressive Tokenization "; th tools/tokenize.lua -save_config wmt15-de-en -mode aggressive < $f > $f.tok ; done

I then concatenated the corpora that would become the training dataset (ie. Raw Europarl v7 + Common Crawl + News Commentary v10), which created the German and English versions of wmt15-all-de-en.??.tok.

Afterwards I preprocessed the data to create the .t7 files needed for training and testing:

$ th preprocess.lua -save_config preprocessing_configs -keep_frequency -log_file preprocessing_log -train_src wmt15-all-de-en.de.tok -train_tgt wmt15-all-de-en.en.tok -valid_src newstest2013.de.tok -valid_tgt newstest2013.en.tok -save_data wmt15-all-de-en

Finally I ran the training session with the following command:

$ th train.lua -save_config training_configs -data wmt15-all-de-en-train.t7 -report_every 200 -save_model wmt15-de-en_aggressive_tok -exp wmt15_de_en_aggTOK -gpuid 1 -log_file training_log

The default settings, as I understand them, matched the requirements stated in the Benchmark report. So I didn’t change anything else. After training, this resulted in a 9.14 score on the newstest2013 validation set.

Did I overlook something? Could someone help point me in the right direction?

Thanks for the help.

seisqui · October 2, 2017, 9:34am

Has anyone noticed that at least the commoncrawl data for German is not all purely German?

For example items 8~49 have English sentences, and those don’t even match the respective lines in the English corpus. At first I thought that it was a mistake on my part during processing, but it appears like this in the files provided here in OpenNMT (http://opennmt.net/Models/) and those in the WMT website (http://www.statmt.org/wmt17/translation-task.html). Does anyone know if there is a corrected version? As far as I can tell this has not been reported. If it is, where can I find it?

Thank you.

wiktor.stribizew · October 12, 2017, 10:18am

@seisqui I did not find Common Crawl corpus problematic, but I did have issues with the New Comments corpus (due to different number of source and target “sentences”) that I had to download as an XLIFF file from http://www.casmacat.eu/corpus/news-commentary.html.

You might want to run a cleanup on the corpora before using them for model building. As part of the regular cleanup steps, like removing untranslated segments, duplicates, etc., you may run a language verification check, against the source, target or both. E.g. I work on TMX files using our corporate tool and then convert them to TXT parallel corpora.