I’m trying to do in-domain adaptation by creating first an out-of-domain neutral model (4.5M pairs), and then apply fine-tuning, after updating the vocabulary (400K pairs).
The scores I’m getting for the general model and for the in-domain model are pretty similar (about 55 BLEU). I thought the in-domain metrics would be much higher, but for some reason they aren’t.
I’m using opennmt-tf, TransformerFP16 with BPE 64K in both models (generic and in-domain)
If you haven’t already done it, I think you should compare both models with in-domain validation/test data in order to see the performance of your in-domain model.
Would you say the problem could be in the proportion of out-of-domain vs in-domain data used in the fine tuning stage? So far I’m using just in-domain data for the fine tuning.
In case you consider I should use mixed-domain data:
What would you say is a correct proportion between out-of-domain and in-domain data?
What amount of data would you say is required in the fine tuning stage? (Now I have 4.5M out-of-domain pairs and 400K in-domain pairs)
It depends how specialized you want your in-domain model to be. The general advice is to mix some generic data with your in-domain data so your new model doesn’t forget completely how to translate a bit more generally. In order to find the right proportion, you probably have to experiment a bit. The size of your out-of-domain and in-domain data is fine.