I am unsure what loss to look at to decide how long to leave the model training. I am using Tensorboard to track the training and validation accuracy (see training accuracy and validation accuracy ) for the machine translation data and tutorial. Is it possible to plot these two accuracies/losses in the same graph? And is there any early stopping criterion implemented to stop the model from overfitting? At the same time I am confused with the number of epoch and number of training steps - as it seems now in the plot there is on the x-axis the number of training steps and not the epochs. And finally, how often is the model validated and why the x-axis for the training and validation plots does not have the same number of steps? Thank you!
cc from issues:
There is no stopping criterion. You need to check the Validation PPL/ACC when it has converged.
Someone started a PR for early stopping but was never finished.
From Ben Peters:
At the moment, there is no option for training by epochs or recording how many epochs have elapsed as the model trains. There used to be one, but it was removed earlier this year. So if you want to validate once per epoch (as is common), you’ll need to calculate how many steps constitute an epoch for yourself. If you’re training on CPU or a single GPU, this is the train set size divided by the batch size (plus one if the numbers are not divisible). It should be possible to reimplement validating by epochs without too much difficulty. If you would like to work on it yourself, the relevant code is probably mostly in onmt/Trainer.py. (I may put together a pull request about this myself)
If you set a verbose level > 0 there is an option
to log an info when 1 epoch is completed.