Re-training Existing Model with Monolingual Corpora

someaditya · August 19, 2018, 6:25pm

I recently trained the basic model with Hindi-English Data (1.5m+)(http://www.cfilt.iitb.ac.in/iitb_parallel/). I achieved a BLEU Score of about 13.72 which is one of the highest for the dataset. Is there any functionality in OpenNMT to re-train the model with monolingual data (English and Hindi Monolingual Corpus) to improve the score?

As discussed by Sennrich et al in http://www.aclweb.org/anthology/P16-1009

jean.senellart · August 21, 2018, 3:30pm

Hello, the paper presents 2 methods, one is to add fake pair - target, and the second one is to add back translated sentences. The former requires a slight development and is not very efficient, the latter just requires that you trained another engine with the same bilingual data but with the other direction (Hindi>English if you are interested in English>Hindi). Then you can generate from your Hindi corpus, a lot of back-translated Hindi-English corpus that you reverse to use for your English-Hindi model. This is a very common practice when data is a bit scarce, or for domain specialization and works very well.
You don’t need any special code adaptation for that.

See also Use of Monolingual corpora in OPENNMT - and several other post on the forum for discussion on this.

someaditya · August 23, 2018, 10:43am

Thank you very much. How much we should train the initial model ?

someaditya · August 24, 2018, 10:07pm

Is there any implementation in OpenNMT through which we can add more words to vocabulary ? Or shall we keep the vocabulary as it is and only change the training data ?

jean.senellart · August 26, 2018, 7:24pm

You just train the reverse model as any model - till it converges. For the volume of back-translated data you can use - thumbs rule is to keep the size order than the available bilingual corpus. If you mix too much back-translated data with actual true bilingual data, we observed that the quality of the final model starts decreasing.

jean.senellart · August 26, 2018, 7:26pm

with OpenNMT-lua and OpenNMT-tf, you can change dynamically the vocabulary - but for the point in this thread you do not actually need to do that, do you?

someaditya · August 28, 2018, 9:44am

Thank you very much.
I think we don’t need it, as the data will be trained using the new vocabulary.

So, the process of Back Translation in OpenNMT can be implemented in the following way.

Train a reverse model.
Translate the monolingual data, generate synthetic pairs.
Pre-process the monolingual+synthetic data using preprocess.py which creates new
demo.train.pt , demo.valid.pt and demo.vocab.pt.
Train the model using the newly created PyTorch indices, and use the pre-trained model
using “train_from demo-model_xx.xx.xx.pt”.

Am I correct?

I am getting an issue during the training, the model never re-trains. After showing “Number of examples: 31456” it stops, it doesn’t even display an error message.

jean.senellart · August 29, 2018, 5:34pm

Dynamic vocabulary is implemented in OpenNMT-lua and OpenNMT-tf - I don’t think it is in OoenNMT-py