Yes.
Yes.
2000 sentences are enough for the validation set.
Yes.
Yes.
2000 sentences are enough for the validation set.
I got it. Thanks a lot, Guillaume!
Hi,
I want to translate with bidrectional model I trained, so I add -config, but it showed unknown option? Does it mean I don’t need to feed in any config?
th translate.lua -config ../en-de_gru_brnn_4.txt \
-model ../wmt15-de-en/en-de_gru_brnn_4/wmt15-en-de-gru-brnn-4_epoch1_42.90.t7 \
-src ../wmt15-de-en/test/newstest2014-deen-src.en.tok -output ../epoch1.txt -gpuid 1
**/home/lijun/torch/install/bin/luajit: ./onmt/utils/ExtendedCmdLine.lua:198: unkown option brnn**
I use brnn = true to train the model. Did I use wrong option to run?
Hi - you do not need to use -config
for translation - the options are part of the model.
The error message says that -brnn
is not a translate.lua option.
Hi,
I tried to run bpe version with 2 layers, other settings are same as non-bpe version, but only got BLEU score on 18.69, wrt your report 19.34, it seems a big gap. Is there any point I should pay attention to?
Besides, I ran bi-lstm, and 4 layers, but only got 16.95 BLEU score, is it reasonable? I think 4 layers should be better than 2 layers (according to Google’s massive exploration on NMT).
Thanks very much.
maybe just provide the 2 command lines you used for your training.
that might to see what could lead to different reuslts.
For 4 layers bi-lism, I made a config file with “brnn = True” and “layers = 4”, then
$ th train.lua -config xxx -data xxx -save_model xxx -gpuid 1
For bpe version, I did as tutorial “Training Romance Multi-Way model”, then ran the same command line.
Could some one provide a deep model running recipe? Thanks a lot.
@jean.senellart I am about to test ENDE training, and noticed you did not use -case_feature true
at the tokenization step. Isn’t it a default action for English-German, as German noun capitalization issues may appear if all the input corpus is lowercased, or do I understand -case_feature
option wrong?
Hi Wiktor, yes in general using -case_feature
is a good idea for almost all languages (with case) - and shows good performance for dealing and rendering in target case variation in the source (for instance all uppercase/lowercase, or capitalized words, …).
Here, you probably also want a BPE tokenization to better generate German compounds too.
@jean.senellart Thank you for bringing BPE tokenization to my attention. However, I am not sure I can find the right procedure, how to proceed here. Sorry, if these are basic questions, but I spent a whole day figuring out this:
I guess BPE tokenization must be run first. I run for f in $FLDR/*.txt; do th tools/learn_bpe.lua -size 30000 -save_bpe $f.bpe < $f; done;
. This results in several No pair has frequency > 1. Stopping messages, but the *.bpe
files are created - is there anything I should worry about here?
[07/26/17 19:08:26 INFO] Building vocabulary from STDIN
[07/26/17 19:08:26 INFO] Geting pair statistics from vocabulary
[07/26/17 19:08:26 INFO] Generating merge operations to output
[07/26/17 19:08:27 INFO] … 1000 merge operations generated
[07/26/17 19:08:28 INFO] … 2000 merge operations generated
[07/26/17 19:08:29 INFO] … 3000 merge operations generated
[07/26/17 19:08:30 INFO] … 4000 merge operations generated
[07/26/17 19:08:31 INFO] … 5000 merge operations generated
[07/26/17 19:08:33 INFO] … 6000 merge operations generated
[07/26/17 19:08:35 INFO] … 7000 merge operations generated
No pair has frequency > 1. Stopping
Looking at the .bpe
files, I am not sure I should run BPE tokenization on English corpora. The EN .bpe
file contains entries like wr ite
, writ ten</w>
, etc. Is that correct?
Then I ran tools/tokenize.lua
:
for f in $FLDR/*.txt; do th tools/tokenize.lua -bpe_model $FLDR/${f##*/}.bpe -joiner_annotate true -case_feature true < $f > $f.tok; done;
I guess these .tok
files should be fed to the preprocess.lua
script without any additional parameters, right?
th preprocess.lua -config cfg.txt
(with the default 5 parameters), I get 2 more *.dict files: train2.target_feature_1.dict
and train2.target_feature_1.dict
- should I specify them anwhere in the train.lua script parameters? Or should I just run th train.lua -data $FLDR/ENDE-train.t7 -save_model ENDE_model
?Thank you.
PS I wrote $FLDR
in the post, while I actually had to type the full literal folder path to the files, I have not figured out how to use variables in there.
Hi Wiktor,
See the following link for also a discussion about BPE and case handling http://opennmt.net/OpenNMT/tools/tokenization/#bpe
You have also a step by step in this tutorial:
yes
you don’t need to specify them - they are not used for the training (everything is prepared in the .t7
file).
Jean
My apologies if this is posted in the wrong place. It is about training a German-English model.
I tried to replicate the results of the benchmark model presented in the OpenNMT Models and Recipes page (http://opennmt.net/Models/). I followed a tokenization procedure as the one in the tutorial presented in this forum thread (but as it applied to DE->EN). As the Benchmark indicates, I only really changed the -mode parameter to aggressive and left everything else in the default settings:
$ for f in wmt15-de-en/*.?? ; do echo "— tokenize $f with Aggressive Tokenization "; th tools/tokenize.lua -save_config wmt15-de-en -mode aggressive < $f > $f.tok ; done
I then concatenated the corpora that would become the training dataset (ie. Raw Europarl v7 + Common Crawl + News Commentary v10), which created the German and English versions of wmt15-all-de-en.??.tok.
Afterwards I preprocessed the data to create the .t7 files needed for training and testing:
$ th preprocess.lua -save_config preprocessing_configs -keep_frequency -log_file preprocessing_log -train_src wmt15-all-de-en.de.tok -train_tgt wmt15-all-de-en.en.tok -valid_src newstest2013.de.tok -valid_tgt newstest2013.en.tok -save_data wmt15-all-de-en
Finally I ran the training session with the following command:
$ th train.lua -save_config training_configs -data wmt15-all-de-en-train.t7 -report_every 200 -save_model wmt15-de-en_aggressive_tok -exp wmt15_de_en_aggTOK -gpuid 1 -log_file training_log
The default settings, as I understand them, matched the requirements stated in the Benchmark report. So I didn’t change anything else. After training, this resulted in a 9.14 score on the newstest2013 validation set.
Did I overlook something? Could someone help point me in the right direction?
Thanks for the help.
Has anyone noticed that at least the commoncrawl data for German is not all purely German?
For example items 8~49 have English sentences, and those don’t even match the respective lines in the English corpus. At first I thought that it was a mistake on my part during processing, but it appears like this in the files provided here in OpenNMT (http://opennmt.net/Models/) and those in the WMT website (http://www.statmt.org/wmt17/translation-task.html). Does anyone know if there is a corrected version? As far as I can tell this has not been reported. If it is, where can I find it?
Thank you.
@seisqui I did not find Common Crawl corpus problematic, but I did have issues with the New Comments corpus (due to different number of source and target “sentences”) that I had to download as an XLIFF file from http://www.casmacat.eu/corpus/news-commentary.html.
You might want to run a cleanup on the corpora before using them for model building. As part of the regular cleanup steps, like removing untranslated segments, duplicates, etc., you may run a language verification check, against the source, target or both. E.g. I work on TMX files using our corporate tool and then convert them to TXT parallel corpora.