Basic example OpenNMT and Moses PT<>ES CA<>ES BLEU score results


First of all excuse me if this post it is just trivial, but just trying to learn.

Because we are not very familiar with MT, we have tried some basic figures.

We have used (with some cleaning) the ES<>PT Europarl bulk corpus, and ES<>CA(talan) Catalan Goverment Official Diary corpus, each one close to 2 million sentences/lines

We have segmented both corpus using Moses scripts, and run them against plain Moses and plain OpenNMT, and compare them with the Moses multi-bleu score script. Our findings are somehow expected, but I think are interesting:

  • BLEU for OpenNMT and Moses scores are almost the same. What surprises us is the reason of this similarity, as does not look obvious to us. We understand that is due to the underlying technology that is somehow the same. Also as others have seen, Moses BLEU scores are (a little bit) better. Not sure if it is because how BLEU is calculated (MT translation file vs Gold translated file) or because indeed the “quality” is the same or because as others just quickly say “you should dismiss as BLEU as a valid metric”.

  • The Europarl PT<>ES scores are in the 35-40 range. But with the DOGC CA<>ES scores are in the 85-90 range. PT<>ES<>CA are very close languages, they should be very similar, but they are not. Obviously, the difference lays in the corpus. So, as also we have heard, you first need good data, otherwise, no system will deliver good results.

  • Also we have seen that you can have good sense of the openNMT results without the full corpus or all the epoch runs, as after some point, BLUE scores do not dramatically change.

What we have learnt is that probably your main basket for your eggs is a good corpus. Also, looks like we should proceed with great care with any conclusion based in BLEU scores or in minimal BLEU score changes.

If anyone is interested in more detail here is the link with some more detail

Hope this helps
have a nice day
miguel canals

1 Like


Interesting report, thanks for sharing!

Looks like you have well explored the data preparation side but did you also compare different model
and training configurations (e.g. number of epochs) in OpenNMT? Your writeup seems to indicate that you used the default parameters which are actually not the best to achieve optimal BLEU score.

If that’s the case, you should compare with a larger NMT model, e.g. in OpenNMT-lua:

-encoder_type brnn -layers 4 -rnn_size 800

Also 172K is not a lot of resources for NMT standards. It is known that SMT performs better than NMT in small data regime.