Best checkpoint selection for NMT

When selecting the best checkpoint what is the recommended approach?

  1. The validation set scores are produced based on accuracy, perplexity. If early stopping is not used could you advice how to select the best model? Do we need to check the BLEU score of the validation set as well, in addition to accuracy & perplexity?

  2. When early stopping is Used, with both accuracy & perplexity the checkpoint given as best model, is not the model with highest accuracy? Is this acceptable? Or should we select by only going with accuracy?

  3. What is the impact if we only go by early stopping with Accuracy?

  1. Perplexity is a good candidate, but if you can also compute the BLEU score it’s usually better since it is more specific to the MT task. Note that if you are training with lots of data and default Transformer parameters, just selecting the latest checkpoint is very often a solid choice.

  2. As far as I know accuracy is rarely used as a metric for NMT. I don’t think you should rely on it for early stopping.

  3. As said in 2., you should probably ignore accuracy for early stopping.

Thank you for your reply.

In my training at step 30K the perplexity is the least.
But if I train for a fixed number of iterations, say upto 100K steps, the BLEU score is highest around step 70K.

What would be the implications for selecting the best checkpoint at 70K steps, going by the BLEU score?

Dear Aloka,

As we are both waiting for Guillaume’s answer, I want to refer to a tip of using multiple checkpionts.

In this case, I would try two things:

Averaging

For example, you can average from checkpoint 70k to the end. I would try also multiple combinations.

python3 OpenNMT-py/tools/average_models.py -mode model01.pt model02.pt model03.pt -output model_avg.pt

Replace model01.pt model02.pt model03.pt with the checkpoints you want to average. You can even use bash regex like model{7..10}.pt

Then, translate with the output averaged model and see if BLEU is better than what you got from individual models.

Ensemble Decoding

You just translate as usual, but add two model files instead of one.

Note that Ensemble Decoding takes more translation time though.


Using Averaging and/or Ensemble Decoding can give you up to 2 extra BLEU score while in some cases it does not make a difference. So it is worth a try.

Kind regards,
Yasmin

1 Like

I’m not sure to understand the question. Do you mean to ask what is the difference between selecting the best BLEU vs. the best perplexity in terms of human evaluation?

Thank you for the details on ensemble model. It would be worth checking on this.

which is the common approach - ensembling or averaging the last checkpoints? Also the number of models, is it a choice of the individual?

Let me explain - I wanted to know whether selecting the checkpoint with the highest bleu score (70000.pt) over the checkpoint with the best perplexity (30000.pt) is acceptable?

The main consideration is that the averaged model does not take more time while translating with ensemble decoding takes more time because it uses multiple models at the same time.

This needs trying different combinations, but one thing to try first is from the best BLEU model to the end.

Kind regards,
Yasmin

@ymoslem , Many Thanks.

Sure, it’s fine to select the model with the highest BLEU score on the validation set.

@guillaumekln Many Thanks.

You are welcome! Please report your results here when you try this.