OpenNMT-tf Fine tuning base model gives worse and decreases BLEU scores


I see that after fine tuning base model my BLUE scores keeps dropping (during evaluation)

Step: 31010

Step: 33010

Step: 35010

When looking further I see that some of my predictions are empty (blank translations).

Has anybody else experienced this before ? Not sure why it would do that.

Mohammed Ayub


It’s a run with a new vocabulary, right? How much different is it?

If the new vocabulary is very different than the base one (i.e. many additions and deletions), then I would say this behavior is expected as the optimization may move the parameters in a very different point in the search space.

1 Like

It seems to me like its not very different as -
Merged Source Vocabulary (English) added only 3,524 new terms (changed from 50,000 --> 53,524 )
Merged Target Vocabulary (Spanish) added only 6,106 new terms (changed from 50,000 -->56,106 )

My assumption was BLEU score would drop a little initially then stabilize and improve on the base model. ?

Also, Any thoughts on the increasing empty translations in the predictions file (not sure why that’s happening).

Thanks !

Mohammed Ayub

Anyway you can please replicate this if I send the checkpoint and vocab ?

Mohammed Ayub

Can you give more information on the training and validation sets you used for the fine-tuning?

Sure. FYI - The base model is built with UN parallel corpus.

Details for fine tune data -
This is data related to fire standards that we internal maintain within our organization. They are very similar to legal languages used within patents, with inclusion of measurements and metrics for buildings structures and equipment related to various domains. We had the historical translations of these standards into Spanish. They have been curated to match each English line with its Spanish equivalent.

Sentence Sizes:
Train (both En and Es) - 37,000 Parallel Sentences
Validation - 1,000 Parallel Sentences

Vocabulary -
Pre-processed (Tokenized and BPE encoded) and built the vocabulary on the training data set.

Let me know if you need any specific information.

Thanks !
Mohammed Ayub

Do you confirm that the validation is in-domain and not generic data?

Yes, I confirmed that validation is in-domain sentences. There might be minor overlap of the language used but mostly it is in-domain.
Also, For some of the words (which were giving black/empty predictions) I manually traced them back and checked that they exist in the merged vocabulary but are not predicted for evaluations set.

Hi @guillaumekln

After reading couple of blog posts about similar issues, looks like there could be several reasons to this behaviour, most prominent one seems to be when the base model is over-fit and it fails to generalize to other data sets. (Other reasons are data shuffling during training, regularization and other hyper parameters). ?

Fine-tuning the model on other partially trained base models seemed to solve the issue for me, for now.