No, it does not. You just need to get the multi-bleu.perl
script:
Thanks - that is super helpful. I will take a look.
Thanks
Ganesh
Hello
I’ve followed thankful instructions in this tutorial, but have a problem.
(also added -mode aggressive option during tokenization to use open model parameters)
The problem I have is similar to what livenletdie had above.
I tried test to evaluate given open model by command bellow
th translate.lua -model onmt_baseline_wmt15-all.en-de_epoch13_7.19_release.t7 -src data/wmt15-de-en/newstest2013.en.tok -tgt data/wmt15-de-en/newstest2013.de.tok -output pred_new.txt -gpuid 3
What I’ve got in the end is:
[02/13/17 19:12:38 INFO] PRED AVG SCORE: -0.46, PRED PPL: 1.59
[02/13/17 19:12:38 INFO] GOLD AVG SCORE: -inf, GOLD PPL: inf
I presumed that something is wrong with the golden data “newstest2013.de” and retried the preprocess steps several times but could not solve the problem.
(I get bad scores on the golden data, sometimes over -100)
Are there any possible solutions?
Thanks in advance for your kind help.
Oh there was actually a small error when reporting the final score on the gold data. You may want to update the project and retry.
https://github.com/OpenNMT/OpenNMT/commit/af47fa34710eb35d98e5b162ff0917aafc1b3411
It works now!
Thankyou very much.
Hi, how can I make the -valid_src/-valid_tgt resource used with preprocess.lua command? Are they required by the command? Can I just use a subset of the -train_src/-train_tgt resource as it? Thanks!
Yes.
Yes.
2000 sentences are enough for the validation set.
I got it. Thanks a lot, Guillaume!
Hi,
I want to translate with bidrectional model I trained, so I add -config, but it showed unknown option? Does it mean I don’t need to feed in any config?
th translate.lua -config ../en-de_gru_brnn_4.txt \
-model ../wmt15-de-en/en-de_gru_brnn_4/wmt15-en-de-gru-brnn-4_epoch1_42.90.t7 \
-src ../wmt15-de-en/test/newstest2014-deen-src.en.tok -output ../epoch1.txt -gpuid 1
**/home/lijun/torch/install/bin/luajit: ./onmt/utils/ExtendedCmdLine.lua:198: unkown option brnn**
I use brnn = true to train the model. Did I use wrong option to run?
Hi - you do not need to use -config
for translation - the options are part of the model.
The error message says that -brnn
is not a translate.lua option.
Hi,
I tried to run bpe version with 2 layers, other settings are same as non-bpe version, but only got BLEU score on 18.69, wrt your report 19.34, it seems a big gap. Is there any point I should pay attention to?
Besides, I ran bi-lstm, and 4 layers, but only got 16.95 BLEU score, is it reasonable? I think 4 layers should be better than 2 layers (according to Google’s massive exploration on NMT).
Thanks very much.
maybe just provide the 2 command lines you used for your training.
that might to see what could lead to different reuslts.
For 4 layers bi-lism, I made a config file with “brnn = True” and “layers = 4”, then
$ th train.lua -config xxx -data xxx -save_model xxx -gpuid 1
For bpe version, I did as tutorial “Training Romance Multi-Way model”, then ran the same command line.
Could some one provide a deep model running recipe? Thanks a lot.
@jean.senellart I am about to test ENDE training, and noticed you did not use -case_feature true
at the tokenization step. Isn’t it a default action for English-German, as German noun capitalization issues may appear if all the input corpus is lowercased, or do I understand -case_feature
option wrong?
Hi Wiktor, yes in general using -case_feature
is a good idea for almost all languages (with case) - and shows good performance for dealing and rendering in target case variation in the source (for instance all uppercase/lowercase, or capitalized words, …).
Here, you probably also want a BPE tokenization to better generate German compounds too.
@jean.senellart Thank you for bringing BPE tokenization to my attention. However, I am not sure I can find the right procedure, how to proceed here. Sorry, if these are basic questions, but I spent a whole day figuring out this:
-
I guess BPE tokenization must be run first. I run
for f in $FLDR/*.txt; do th tools/learn_bpe.lua -size 30000 -save_bpe $f.bpe < $f; done;
. This results in several No pair has frequency > 1. Stopping messages, but the*.bpe
files are created - is there anything I should worry about here?
[07/26/17 19:08:26 INFO] Building vocabulary from STDIN
[07/26/17 19:08:26 INFO] Geting pair statistics from vocabulary
[07/26/17 19:08:26 INFO] Generating merge operations to output
[07/26/17 19:08:27 INFO] … 1000 merge operations generated
[07/26/17 19:08:28 INFO] … 2000 merge operations generated
[07/26/17 19:08:29 INFO] … 3000 merge operations generated
[07/26/17 19:08:30 INFO] … 4000 merge operations generated
[07/26/17 19:08:31 INFO] … 5000 merge operations generated
[07/26/17 19:08:33 INFO] … 6000 merge operations generated
[07/26/17 19:08:35 INFO] … 7000 merge operations generated
No pair has frequency > 1. Stopping
-
Looking at the
.bpe
files, I am not sure I should run BPE tokenization on English corpora. The EN.bpe
file contains entries likewr ite
,writ ten</w>
, etc. Is that correct? -
Then I ran
tools/tokenize.lua
:
for f in $FLDR/*.txt; do th tools/tokenize.lua -bpe_model $FLDR/${f##*/}.bpe -joiner_annotate true -case_feature true < $f > $f.tok; done;
I guess these .tok
files should be fed to the preprocess.lua
script without any additional parameters, right?
- When I run preprocess.lua with
th preprocess.lua -config cfg.txt
(with the default 5 parameters), I get 2 more *.dict files:train2.target_feature_1.dict
andtrain2.target_feature_1.dict
- should I specify them anwhere in the train.lua script parameters? Or should I just runth train.lua -data $FLDR/ENDE-train.t7 -save_model ENDE_model
?
Thank you.
PS I wrote $FLDR
in the post, while I actually had to type the full literal folder path to the files, I have not figured out how to use variables in there.
Hi Wiktor,
- the No pair has frequency > 1 message is not normal but I understand that you are running it on test/valid corpus too so it would explain. The bpe model should be trained on the training corpus only - and ideally, you train one single model for source and target so that the model learns easily to translate identical word fragments from source to target.
So I would concatenate source and target training corpus - then train tokenize it once, then learn a bpe model on this single corpus, that you then use for tokenization of test/valid/train corpus in source and target.
See the following link for also a discussion about BPE and case handling http://opennmt.net/OpenNMT/tools/tokenization/#bpe
You have also a step by step in this tutorial:
yes
you don’t need to specify them - they are not used for the training (everything is prepared in the .t7
file).
Jean
My apologies if this is posted in the wrong place. It is about training a German-English model.
I tried to replicate the results of the benchmark model presented in the OpenNMT Models and Recipes page (http://opennmt.net/Models/). I followed a tokenization procedure as the one in the tutorial presented in this forum thread (but as it applied to DE->EN). As the Benchmark indicates, I only really changed the -mode parameter to aggressive and left everything else in the default settings:
$ for f in wmt15-de-en/*.?? ; do echo "— tokenize $f with Aggressive Tokenization "; th tools/tokenize.lua -save_config wmt15-de-en -mode aggressive < $f > $f.tok ; done
I then concatenated the corpora that would become the training dataset (ie. Raw Europarl v7 + Common Crawl + News Commentary v10), which created the German and English versions of wmt15-all-de-en.??.tok.
Afterwards I preprocessed the data to create the .t7 files needed for training and testing:
$ th preprocess.lua -save_config preprocessing_configs -keep_frequency -log_file preprocessing_log -train_src wmt15-all-de-en.de.tok -train_tgt wmt15-all-de-en.en.tok -valid_src newstest2013.de.tok -valid_tgt newstest2013.en.tok -save_data wmt15-all-de-en
Finally I ran the training session with the following command:
$ th train.lua -save_config training_configs -data wmt15-all-de-en-train.t7 -report_every 200 -save_model wmt15-de-en_aggressive_tok -exp wmt15_de_en_aggTOK -gpuid 1 -log_file training_log
The default settings, as I understand them, matched the requirements stated in the Benchmark report. So I didn’t change anything else. After training, this resulted in a 9.14 score on the newstest2013 validation set.
Did I overlook something? Could someone help point me in the right direction?
Thanks for the help.
Has anyone noticed that at least the commoncrawl data for German is not all purely German?
For example items 8~49 have English sentences, and those don’t even match the respective lines in the English corpus. At first I thought that it was a mistake on my part during processing, but it appears like this in the files provided here in OpenNMT (http://opennmt.net/Models/) and those in the WMT website (http://www.statmt.org/wmt17/translation-task.html). Does anyone know if there is a corrected version? As far as I can tell this has not been reported. If it is, where can I find it?
Thank you.