Question on some results on the Benchmark platform

Hi @alexandrapriori (not sure user is registered here)

When I look at the result of this config:

for newstest2014, it reports 29.21 BLEU.

is it the plain default config ? On my end I never got higher than 25 with the 1M training set and default parameters.

what BLEU tool did you use ? mt-eval ? multibleu ? no tokenized ? detokenized ?


Hi @vince62s

We trained our baselines with the default parameters.
The BLEU tool we use for WMT tests is mt-eval. We submit detokenized tests but the tool does a basic tokenization to evaluate.
However, for generic_tests, we use multibleu, and there is no tokenization.


hmm looks really then.
Would you mind sharing the full training command line as well as the tokenization parameters used ?

many thanks.

We tokenized our baselines with tokenize.lua -joiner_annotate

Preprocess and training parameters are the default ones.


Hello @alexandrapriori

Would you mind double checking your results for at least English to French ?
I think it does not make sense you get similar results with Generic test set vs news test set.

I ran the default config and got 27.31 with newstest2014 with MT-eval.
For generic (very in-domain) I got 43.37

Results may have shifted from column to the other.

My very first comment on the thread was saying hight in the 25 or so, because I was using case_feature and BPE. So surprisingly we got worse results with these 2 features …

Hello @vince62s,

I checked the file & output IDs. Results haven’t shifted from a column to another.
I also checked test file newstest2014 wasn’t in the training corpus.

Is enfr the only language pair that displays scores that look wrong to you?