I am using the Dynamic Dataset feature to train an MT model from a large parallel corpus. During training I had an OutOfMemory error and I had to resume the training (I think I avoided the error in the second run by reducing the maximum sentence size). Anyway, two questions arose when resuming the training:
As Dynamic Dataset takes random samples from the training data, I guess that it is completely safe to resume a training, right?
But then, even if the training does not crash, is it possible that, due to the random sampling, some of the sentences of the training set are not explored during training and other ones are overrepresented? Should I expect the same translation quality in a system trained with “partition sampling” and in a system trained with “uniform sampling” (keeping all the other parameters unchanged, of course)?
Yes, it is safe to resume a training - there is no difference between a single training or multiple trainings with one-step iterations.
Yes, there is a probability that some sentences are not explored, but if your corpus is relatively small (let say 10M sentences) the probability that the same sentences is not picked through multiple epochs is tiny, and if your corpus is very large (100M+) - then it will have to select sentences anyway. If you want to make sure that one specific corpus is always taken at each epoch, just use “*” as a weight.
Morever, I empirically confirmed that training with dynamic dataset and with classical preprocess.lua + train.lua produce similar results with a 17M sentences training corpus.