Hi,
I’m trying to improve a little bit the way I save and average checkpoints in training. And maybe you have some ideas.
What I would like to achieve is the following: save a checkpoint every x number of steps (cf. save_checkpoints_steps
), but only on n best BLEU (similar to what export_on_best
does, but saving a checkpoint instead) and based on the maximum number of checkpoints to keep (cf. keep_checkpoint_max
), to eventually average them.
I would like to do so because I observed that latest n checkpoints are not necessarily the best n checkpoints from the entire training. Actually, it is quite normal to see a slight degradation in the latest checkpoints, which I would like to avoid for generating the averaged model.
I understand that the evaluation runs independently from the checkpoint saving by design, but it would be nice to have something like keep_best
in addition to save_checkpoints_steps
, so training only keeps a checkpoint if the BLEU from eval is within the n best checkpoints.
What do you think?