Domain Adaptation can be useful when you want to train a Machine Translation model, but you have only too limited data.
There are several approaches of Domain Adaptation including:
Incremental Training / Re-training: So you have a big pre-trained model trained on a big corpus, and you continue training it with the new data from the small corpus.
Ensemble Decoding (of two models): You have two models and you use both models during translation.
Combining Training Data: You merge the two corpora and train one model on the whole combined data.
Data Weighting: You give higher weights for specialized segments over generic segments.
In this tutorial, I explained how to apply these techniques and the best practices.
Check also this elaborate explanation of the affective approach, Mixed Fine-Tuning (Chu et al., 2017).
If you have questions or suggestions, please let me know.
Thank you for this useful information in Domain Adaptation. I have a question about the “Ensemble Decoding” and “Ensemble Decoding (of two models)” methods In the provided tutorial. I am confused about the differences between them in the practical aspect.
Incremental Training: “Incremental Training means to train a model on a corpus and then continue training the same model on a new corpus.”
Ensemble Decoding: “Note here that you do not train the two models independently; however, your second model is actually incrementally trained on the last checkpoint of the first model.”
In the practical aspect, both models continue the training step on the in-domain data.
What is the difference between these methods?
Could you please elaborate more on the Ensemble Decoding method to implement in opennmt-py?
Ensemble Decoding per se is not related to Incremental Training. You can ensemble any two checkpoints of your training. You just translate as usual, but you add two models instead of one.
Here is an example OpenNMT-py command, in which you use two checkpoints, model_step_900.pt and model_step_1000.pt, to translate the same input file:
So, in Ensemble Decoding, we use both models, baseline and fine-tuned for prediction. But in the Incremental Training method, we just use the fine-tuned model to predict. The fine-tuning step is similar between the two methods. Is it right?
I have another question about sharing vocab between these two models. How can we handle it in OpenNMT-py? Must I build a common vocab list on both corpora and use it? Or Is it possible to update vocabulary in the middle of the training step?
OpenNMT-py does not have an update vocab feature (see here). OpenNMT-tf has it.
However, this depends on the purpose of fine-tuning:
1- Unless you fine-tune for a domain with very distinguished terms (say Latin medical terms, or proper names), you already have most of the vocab you need as sub-words;
2- What fine-tuning is really good at is rather teaching the model to use vocabulary in a different way than the baseline.
Example: chairman vs. president; or vice-president vs. deputy chairperson. All these words (or their sub-words) are expected to be in your vocabulary; what fine-tuning will do is to favour one over the other based on the common usage in the new corpus.
If you already have the two corpora before training, you need to build vocabulary (as well as the sub-wording model) on both of them first.
between all the approaches that you explain in the article, what is the most promising one? I mean, do you know if is there any study or paper which compares the approches in terms of BLEU scores (or other metric)?
My in domain data is not exactly as the data I want to apply the trained model to, however it’s the most similar data I have since they share common features like the vocabulary used. Is there any technique recommended for this scenario?
Train a baseline model A (or you already have one);
Mix your in-domain data with some data from the baseline model, like 50/50 or 70/30. Fine-tune the baseline with the mixed data to get model B;
Translate with Ensemble Decoding, i.e. using both models A and B.
Alternatively (instead of #3), you can try averaging the last or best model checkpoints and try to find out the best result.
Ensemble Decoding: You just translate as usual, but add two or more model files instead of one. Note that this will be slower than regular translation as it uses multiple models.
I guess it is better to have a mix of both. If you originally used sub-wording, this should help as well. Anyhow, domain adaptation is an experimental process, and you will probably have to run a couple of tests until you reach your desired result.
I have updated the original post to reflect a new tutorial on a production-level Domain Adaptation approach that I highly recommend, Mixed Fine-tuning (Chu et al., 2017).
Hi Yasmin, I’ve read your tutorial on domain adaption and I found it really helpful but I have a question. You stated in your post that if the baseline dataset significantly surpasses the in-domain dataset, it is advisable to obtain a sample that is ten times larger than the in-domain dataset. Why is this necessary? If I understood dataset weights correctly, it doesn’t matter the amount of sentences per dataset because data will be sampled by the probability each of them have according to the weights.
The main idea of mixed fine-tuning is to retain the quality of the generic model after fine-tuning. If the in-domain data is very small, you want to add more generic data to avoid overfitting. Then you apply oversampling to increase the effect of the in-domain data. You could apply 1:1 data size instead, and compare the results; it depends on your datasets. There is a paper that had some comparisons.
Moreover, in our paper, we averaged the fine-tuned model with the generic model. This retained the quality of the generic model while improving the translation quality of the fine-tuned model when tested with the in-domain data. There are figures on the last page.
Thank you for your answer and the pointers Yasmin. I’ll try averaging both models to see if the results improve in my use case.
One thing I’m still confused about is the proportions you are mentioning. Acording to the documentacion when building batches, we’ll sequentially take weight example from each corpus. So if we have a generic dataset with 10M sentences and an in-domain one with 500k if we use a 1:9 ratio each bucket will be loaded with 10% sentences of the generic dataset and 90% of the in domain one. Then, it shouldn’t matter the size of the datasets themselves because the weights do all the work and, therefore, there is no need to sample randomly a 10 times subset of the generic dataset. Am I missing something?
I was explaining the approach regardless of the tool. As for OpenNMT-py, assuming that is somehow true for the first epoch of the in-domain dataset, what happens in the next epochs? It is possible that different sentences will be sampled from the generic dataset than those sampled in the first epoch, right? I do not know how this affects the process.
If you are fine-tuning on the in-domain for one epoch, probably it does not matter. However, if you are training for multiple epochs, you can try both with and without splitting the generic dataset, and see how this impacts the results.
As @vince62s and @guillaumekln know more about OpenNMT-py and OpenNMT-tf, they might have more insights.