Training steps & continue training explanation?

Ok, this may sound dumb but I just want to make sure that I’m thinking in right direction here. Please correct me if I’m wrong.

Let’s say I have 1,000,000 parallel sentences and I’m training a translation model with batch_size = 32. As I can see from the training logs, my train/eval loss stops decreasing after say 18,000 steps. So, I stopped the training at that point.

Am I saying this correct?

- An epoch usually means one iteration over all of the training data. 
- For instance if I have 1,000,000 sentences and a batch size of 32 then,
  the one epoch should contain 1,000,000 / 32 = 31,250 steps.

If yes then my model has not even seen the entire dataset in first place. Almost about 424,000 sentences are left during training because I stopped the training after 18,000 steps. Now, I want to continue training on same data but I don’t want my model to go through the already seen first part of the training corpus (i.e. I only want the model to continue training on last remaining 424,000 sentences). When I resume the training, I can see model is picking up the last checkpoint and resuming the training at 18,000 steps.

At this point, I’m just not sure if the model is training on all the training data from beginning or on the remaining 1,000,000 - (18,000 x 32) = 424,000 sentences?


Yes, that’s correct.

Examples are randomly sampled from the full data. So the first 18,000 training steps saw examples from around the corpus, and not just the 18,000*32 first examples.

1 Like