Use of Monolingual corpora in OPENNMT

(Pasindu Nivanthaka Tennage) #1

Monolingual corpora can be used in NMT in 2 different ways. According to the research paper the paralel corpora should be changed to include the monolingual text and some parameters must be frozen. Does OpenNMT support these features?

(Guillaume Klein) #2

Are you referring to this paper?

(Pasindu Nivanthaka Tennage) #3

Yes. It’s the paper that I’m following. To use that do I have to change the openNMT code? Or can I use the existing code with a modified paralel corpora?

(Guillaume Klein) #4

You can do the backtranslation technique without changing the code.

(Pasindu Nivanthaka Tennage) #5

Thank you for the response.
How about the empty source side technique? The paper says that some network parameters should be frozen when training with empty source side, since the output is dependent only on previously generated words (i.e. no annotation vectors)

(jean.senellart) #6

the paper mentions that empty source sentence does not have as good result as using backtranslation, but it might be interesting to test still since it is easier. What I would do personally to experiment is a/ to modify preprocess to allow empty source sentence, b/ add an option to freeze the encoder when the source sentence is empty - which can be done easily by exiting Seq2Seq:trainNetwork just after decoder:backward (

(Vincent Nguyen) #7

@jean.senellart what you mention here seems to have a very small improvement as you say.
On the other hand back translation gives a huge improvement, which I saw gain in the WMT17 results.

This triggers 2 questions to me:

  • even on a base corpus of 5 M segments, monolingual data backtranslated gives a big improvement, which means that basically 5 M segments is far from being optimal

  • as we discussed once, do you think that changing the objective function to include a LM component based on the target corpus could help more than just using the method “freeze encoder+monolingual in target” ?

(jean.senellart) #8

Hi Vincent, I will now be working on:

Sennrich, 2015 mentions that backtranslated synthetic corpus gives better results but even if it is true successful integration of LM could open possibility to use very large LM, and/or tune existing systems without retraining.