Use of Monolingual corpora in OPENNMT

PasinduTennage · May 30, 2017, 3:59pm

Monolingual corpora can be used in NMT in 2 different ways. According to the research paper the paralel corpora should be changed to include the monolingual text and some parameters must be frozen. Does OpenNMT support these features?

guillaumekln · May 31, 2017, 12:03pm

Are you referring to this paper?

PasinduTennage · June 3, 2017, 8:52am

Yes. It’s the paper that I’m following. To use that do I have to change the openNMT code? Or can I use the existing code with a modified paralel corpora?

guillaumekln · June 3, 2017, 9:09am

You can do the backtranslation technique without changing the code.

PasinduTennage · June 3, 2017, 10:27am

Thank you for the response.
How about the empty source side technique? The paper says that some network parameters should be frozen when training with empty source side, since the output is dependent only on previously generated words (i.e. no annotation vectors)

jean.senellart · June 3, 2017, 11:59am

the paper mentions that empty source sentence does not have as good result as using backtranslation, but it might be interesting to test still since it is easier. What I would do personally to experiment is a/ to modify preprocess to allow empty source sentence, b/ add an option to freeze the encoder when the source sentence is empty - which can be done easily by exiting Seq2Seq:trainNetwork just after decoder:backward (https://github.com/OpenNMT/OpenNMT/blob/master/onmt/Seq2Seq.lua#L232)

vince62s · July 20, 2017, 12:34pm

@jean.senellart what you mention here seems to have a very small improvement as you say.
On the other hand back translation gives a huge improvement, which I saw gain in the WMT17 results.

This triggers 2 questions to me:

even on a base corpus of 5 M segments, monolingual data backtranslated gives a big improvement, which means that basically 5 M segments is far from being optimal
as we discussed once, do you think that changing the objective function to include a LM component based on the target corpus could help more than just using the method “freeze encoder+monolingual in target” ?

jean.senellart · July 26, 2017, 12:50am

Hi Vincent, I will now be working on:

Sennrich, 2015 mentions that backtranslated synthetic corpus gives better results but even if it is true successful integration of LM could open possibility to use very large LM, and/or tune existing systems without retraining.