It’s a run with a new vocabulary, right? How much different is it?
If the new vocabulary is very different than the base one (i.e. many additions and deletions), then I would say this behavior is expected as the optimization may move the parameters in a very different point in the search space.
It seems to me like its not very different as -
Merged Source Vocabulary (English) added only 3,524 new terms (changed from 50,000 --> 53,524 )
Merged Target Vocabulary (Spanish) added only 6,106 new terms (changed from 50,000 -->56,106 )
My assumption was BLEU score would drop a little initially then stabilize and improve on the base model. ?
Also, Any thoughts on the increasing empty translations in the predictions file (not sure why that’s happening).
Sure. FYI - The base model is built with UN parallel corpus.
Details for fine tune data -
This is data related to fire standards that we internal maintain within our organization. They are very similar to legal languages used within patents, with inclusion of measurements and metrics for buildings structures and equipment related to various domains. We had the historical translations of these standards into Spanish. They have been curated to match each English line with its Spanish equivalent.
Sentence Sizes:
Train (both En and Es) - 37,000 Parallel Sentences
Validation - 1,000 Parallel Sentences
Vocabulary -
Pre-processed (Tokenized and BPE encoded) and built the vocabulary on the training data set.
Yes, I confirmed that validation is in-domain sentences. There might be minor overlap of the language used but mostly it is in-domain.
Also, For some of the words (which were giving black/empty predictions) I manually traced them back and checked that they exist in the merged vocabulary but are not predicted for evaluations set.
After reading couple of blog posts about similar issues, looks like there could be several reasons to this behaviour, most prominent one seems to be when the base model is over-fit and it fails to generalize to other data sets. (Other reasons are data shuffling during training, regularization and other hyper parameters). ?
Fine-tuning the model on other partially trained base models seemed to solve the issue for me, for now.