Training to keep only best saved checkpoints


I’m trying to improve a little bit the way I save and average checkpoints in training. And maybe you have some ideas.

What I would like to achieve is the following: save a checkpoint every x number of steps (cf. save_checkpoints_steps), but only on n best BLEU (similar to what export_on_best does, but saving a checkpoint instead) and based on the maximum number of checkpoints to keep (cf. keep_checkpoint_max), to eventually average them.

I would like to do so because I observed that latest n checkpoints are not necessarily the best n checkpoints from the entire training. Actually, it is quite normal to see a slight degradation in the latest checkpoints, which I would like to avoid for generating the averaged model.

I understand that the evaluation runs independently from the checkpoint saving by design, but it would be nice to have something like keep_best in addition to save_checkpoints_steps, so training only keeps a checkpoint if the BLEU from eval is within the n best checkpoints.

What do you think?


Did you try export_on_best with export_format: checkpoint and max_exports_to_keep?

There may be additional steps to gather the exported checkpoints for averaging, but I think what you want to do is already possible with existing options.

Many thanks for the hint, Guillaume.

The only issue I see there is that I would need to rely on eval to export the checkpoints and I only run eval every 5k steps, while I would like to keep saving checkpoints every 1k. I guess running eval every 1k would be the solution, but it would definitely have an impact on the training time.

You can have both checkpoint saving during training and checkpoint export during evaluation. They are independent and can run with different frequencies. But the “export on best BLEU” logic can obviously only apply after an evaluation.

Yes, it’s clear. Thanks for your help!