@jean.senellart what you mention here seems to have a very small improvement as you say.
On the other hand back translation gives a huge improvement, which I saw gain in the WMT17 results.
This triggers 2 questions to me:
- even on a base corpus of 5 M segments, monolingual data backtranslated gives a big improvement, which means that basically 5 M segments is far from being optimal
- as we discussed once, do you think that changing the objective function to include a LM component based on the target corpus could help more than just using the method "freeze encoder+monolingual in target" ?