Scorer - Test set vs Validation set


I was curious to know why, when we use the option “scorer”, the BLEU score is determined by the validation set instead of the test set?

I’m I wrong to believe there is a bias to use the BLEU score from the validation set?

From my understanding, it could start overfitting, but the BLEU score would keep increasing.

Thank you,


I believe the training should not know anything about the test sets. The tests are similar to real data, in that they are natural language without further processing (so, what you compare is predicted texts after any post-processing). Validation sets, on the other hand, are part of the training process, and they are already preprocessed and tokenised (at least in my case, since I use offline tokenisation). So, scores from validation are usually higher, but this does not mean the model overfits. However, I did notice that BLEU is less informative about overfitting than loss or perplexity. At least it seems it takes longer to reflect a possible overfitting, while loss might provide feedback on errors earlier. So it is not uncommon that BLEU keeps improving, but loss starts even to degrade. If you don’t have too many data or you think your model might overfit, I would try using loss for validation instead of BLEU as a metric to stop the training, then compare models using the test sets to see what metric turns out to be more reliable for you.

Hope this helps.

1 Like

You answered really well all my questions! Even the one I didn’t explicitly ask for!

I still have one question based on your answer. In Opennmt-tf, there are 2 loss graphs:


  1. I assume the one to use is the one from “metrics” and My model start overfitting at around 10k (after)

Exactly, the first loss is the training loss. If you don’t have too many data, it will continue improving until it “learns” all the data. The second is the validation loss, which is the one to be used as a main criteria to stop the training. From the graphs, I would say your model starts to overfit at around 10k steps, even if BLEU still improves until 15k. As I said, you can save both models, one at 10k and the other one at 15k, then predict the tests sets with both models and see what metric (BLEU, loss or perplexity) is more accurate in detecting the overfitting.

1 Like