First of all excuse me if this post it is just trivial, but just trying to learn.
Because we are not very familiar with MT, we have tried some basic figures.
We have used (with some cleaning) the ES<>PT Europarl bulk corpus, and ES<>CA(talan) Catalan Goverment Official Diary corpus, each one close to 2 million sentences/lines
We have segmented both corpus using Moses scripts, and run them against plain Moses and plain OpenNMT, and compare them with the Moses multi-bleu score script. Our findings are somehow expected, but I think are interesting:
BLEU for OpenNMT and Moses scores are almost the same. What surprises us is the reason of this similarity, as does not look obvious to us. We understand that is due to the underlying technology that is somehow the same. Also as others have seen, Moses BLEU scores are (a little bit) better. Not sure if it is because how BLEU is calculated (MT translation file vs Gold translated file) or because indeed the “quality” is the same or because as others just quickly say “you should dismiss as BLEU as a valid metric”.
The Europarl PT<>ES scores are in the 35-40 range. But with the DOGC CA<>ES scores are in the 85-90 range. PT<>ES<>CA are very close languages, they should be very similar, but they are not. Obviously, the difference lays in the corpus. So, as also we have heard, you first need good data, otherwise, no system will deliver good results.
Also we have seen that you can have good sense of the openNMT results without the full corpus or all the epoch runs, as after some point, BLUE scores do not dramatically change.
What we have learnt is that probably your main basket for your eggs is a good corpus. Also, looks like we should proceed with great care with any conclusion based in BLEU scores or in minimal BLEU score changes.
If anyone is interested in more detail here is the link with some more detail http://www.mknals.com/04_2_OpenNMTvsMoses.html
Hope this helps
have a nice day