My apologies if this is posted in the wrong place. It is about training a German-English model.
I tried to replicate the results of the benchmark model presented in the OpenNMT Models and Recipes page (http://opennmt.net/Models/). I followed a tokenization procedure as the one in the tutorial presented in this forum thread (but as it applied to DE->EN). As the Benchmark indicates, I only really changed the -mode parameter to aggressive and left everything else in the default settings:
$ for f in wmt15-de-en/*.?? ; do echo "— tokenize $f with Aggressive Tokenization "; th tools/tokenize.lua -save_config wmt15-de-en -mode aggressive < $f > $f.tok ; done
I then concatenated the corpora that would become the training dataset (ie. Raw Europarl v7 + Common Crawl + News Commentary v10), which created the German and English versions of wmt15-all-de-en.??.tok.
Afterwards I preprocessed the data to create the .t7 files needed for training and testing:
$ th preprocess.lua -save_config preprocessing_configs -keep_frequency -log_file preprocessing_log -train_src wmt15-all-de-en.de.tok -train_tgt wmt15-all-de-en.en.tok -valid_src newstest2013.de.tok -valid_tgt newstest2013.en.tok -save_data wmt15-all-de-en
Finally I ran the training session with the following command:
$ th train.lua -save_config training_configs -data wmt15-all-de-en-train.t7 -report_every 200 -save_model wmt15-de-en_aggressive_tok -exp wmt15_de_en_aggTOK -gpuid 1 -log_file training_log
The default settings, as I understand them, matched the requirements stated in the Benchmark report. So I didn’t change anything else. After training, this resulted in a 9.14 score on the newstest2013 validation set.
Did I overlook something? Could someone help point me in the right direction?
Thanks for the help.