We trained our baselines with the default parameters.
The BLEU tool we use for WMT tests is mt-eval. We submit detokenized tests but the tool does a basic tokenization to evaluate.
However, for generic_tests, we use multibleu, and there is no tokenization.
Would you mind double checking your results for at least English to French ?
I think it does not make sense you get similar results with Generic test set vs news test set.
I ran the default config and got 27.31 with newstest2014 with MT-eval.
For generic (very in-domain) I got 43.37
Results may have shifted from column to the other.
@jean.senellart
My very first comment on the thread was saying hight in the 25 or so, because I was using case_feature and BPE. So surprisingly we got worse results with these 2 features …
I checked the file & output IDs. Results haven’t shifted from a column to another.
I also checked test file newstest2014 wasn’t in the training corpus.
Is enfr the only language pair that displays scores that look wrong to you?