I recently trained the basic model with Hindi-English Data (1.5m+)(http://www.cfilt.iitb.ac.in/iitb_parallel/). I achieved a BLEU Score of about 13.72 which is one of the highest for the dataset. Is there any functionality in OpenNMT to re-train the model with monolingual data (English and Hindi Monolingual Corpus) to improve the score?
Hello, the paper presents 2 methods, one is to add fake pair - target, and the second one is to add back translated sentences. The former requires a slight development and is not very efficient, the latter just requires that you trained another engine with the same bilingual data but with the other direction (Hindi>English if you are interested in English>Hindi). Then you can generate from your Hindi corpus, a lot of back-translated Hindi-English corpus that you reverse to use for your English-Hindi model. This is a very common practice when data is a bit scarce, or for domain specialization and works very well.
You don’t need any special code adaptation for that.
Is there any implementation in OpenNMT through which we can add more words to vocabulary ? Or shall we keep the vocabulary as it is and only change the training data ?
You just train the reverse model as any model - till it converges. For the volume of back-translated data you can use - thumbs rule is to keep the size order than the available bilingual corpus. If you mix too much back-translated data with actual true bilingual data, we observed that the quality of the final model starts decreasing.
with OpenNMT-lua and OpenNMT-tf, you can change dynamically the vocabulary - but for the point in this thread you do not actually need to do that, do you?
Thank you very much.
I think we don’t need it, as the data will be trained using the new vocabulary.
So, the process of Back Translation in OpenNMT can be implemented in the following way.
Train a reverse model.
Translate the monolingual data, generate synthetic pairs.
Pre-process the monolingual+synthetic data using preprocess.py which creates new
demo.train.pt , demo.valid.pt and demo.vocab.pt.
Train the model using the newly created PyTorch indices, and use the pre-trained model
using “train_from demo-model_xx.xx.xx.pt”.
Am I correct?
I am getting an issue during the training, the model never re-trains. After showing “Number of examples: 31456” it stops, it doesn’t even display an error message.