How long to train a model without overfitting?

Blues · December 29, 2018, 9:11pm

I am unsure what loss to look at to decide how long to leave the model training. I am using Tensorboard to track the training and validation accuracy (see training accuracy and validation accuracy ) for the machine translation data and tutorial. Is it possible to plot these two accuracies/losses in the same graph? And is there any early stopping criterion implemented to stop the model from overfitting? At the same time I am confused with the number of epoch and number of training steps - as it seems now in the plot there is on the x-axis the number of training steps and not the epochs. And finally, how often is the model validated and why the x-axis for the training and validation plots does not have the same number of steps? Thank you!

vince62s · December 30, 2018, 8:22am

cc from issues:
There is no stopping criterion. You need to check the Validation PPL/ACC when it has converged.
Someone started a PR for early stopping but was never finished.

From Ben Peters:
At the moment, there is no option for training by epochs or recording how many epochs have elapsed as the model trains. There used to be one, but it was removed earlier this year. So if you want to validate once per epoch (as is common), you’ll need to calculate how many steps constitute an epoch for yourself. If you’re training on CPU or a single GPU, this is the train set size divided by the batch size (plus one if the numbers are not divisible). It should be possible to reimplement validating by epochs without too much difficulty. If you would like to work on it yourself, the relevant code is probably mostly in onmt/Trainer.py. (I may put together a pull request about this myself)

vince62s · December 30, 2018, 8:24am

If you set a verbose level > 0 there is an option

github.com

OpenNMT/OpenNMT-py/blob/master/onmt/trainer.py#L204


                            logger.info('GpuRank %d: report stat step %d'
                                        % (self.gpu_rank, step))
                        self._report_step(self.optim.learning_rate,
                                          step, valid_stats=valid_stats)


                    if self.gpu_rank == 0:
                        self._maybe_save(step)
                    step += 1
                    if step > train_steps:
                        break
        if self.gpu_verbose_level > 0:
            logger.info('GpuRank %d: we completed an epoch \
                        at step %d' % (self.gpu_rank, step))


    return total_stats


def validate(self, valid_iter):
    """ Validate model.
        valid_iter: validate data iterator
    Returns:
        :obj:`nmt.Statistics`: validation loss statistics

to log an info when 1 epoch is completed.