Thanks. That was really clarifying, I suspected it would be so but I wanted to be sure.
When finetuning on a new domain my approach is to use weighted corpora in training and a mix of in domain and out domain corpora for evaluation:
- Training data: out of domain (60%) and in domain (40%)
- Evaluation data: out of domain (50%) and in domain (50%)
During the training process, BLEU keeps improving on the evaluation set:

Just to see what’s going on with the test, I periodically test the model on in domain data. What I noticed is that it seems to have in domain kwonledge until some point in the training which improves the baseline. However, from that point, even if the model improves BLEU on development data the model starts performing worse on in domain test. Any clues why I get this behaviour? Is it because I have out of domain data in the evaluation set? It happens with different percentages of weighted training data…
Thanks