I have a fairly small vocabulary (<100) and solid amount of samples (10k). How can I estimate how many iterations I need for basic input -> output mapping (no word features).
I think you should just set a high iteration number and monitor the metric you care about (e.g. the evaluation loss) and manually stop the training when this metric meets your expectation.
Then for future training, you can use this knowledge to set a more precise value.
Until recently I used an earlier version of opennmt(-py) that would use epochs instead of training steps.
While I can calculate based on the data how many training steps I would require (more or less), I find it rather cumbersome and I don’t see the benefit of using steps to epochs.
Furthermore, the validation also runs on predetermined number of steps and scores a model after that many steps which for some models (trained on large data) may be quite early in the training processes but for others (trained on small data sets) may be after many iterations when the model has already reached maximum performance many iterations before that. This means that saving and validation are two extra parameters that need to be empirically explored.
Also, while it is easy to compare two (or more) models trained for the same number of steps, such a comparison is not realistic, given that the one may have iterated over the data much more times than the other.
Anyone has any suggestion how to improve the training setup?
One way to fix this, is maybe, validate after each epoch. Also you can have an early stopping on either training or validation perplexity (where validation is ran after each epoch).
It seems to me that the issues you raised are actually drawbacks of epoch-based trainings, not step-based.
The issue with epoch-based training is specifically that it is dependent on the training data size, and thus model saving and validation may occur at different stages of the training.
In contrast, step-based training is independent of the dataset size and you can define a configuration that works and is comparable whether the training data is large or small.
In your message, you should literally change “number of steps” to “epochs” for it to be accurate:
Thanks for the clarification. I think these are different perspectives to the same issue. For me comparing two models with perspective to how many times the complete data set has been used is more important than to how many forward/backward propagations have been ran.
And do you know if there any work on early stopping for a next version of OpenNMT?
Yes, I’m fairly confident it will happen in both OpenNMT-py and -tf soon.
Great, thanks a lot.