How do I work out how many times my sentences are being trained (or if even at least once)?
I really have no idea which of these parameters (below) to use to work that out.
sentence count (per language) = 4.5 million (eg en-de)
steps = 200,000
The easiest is probably to check in your logs how many times each shard has been loaded at any point.
Loading dataset from.... your_dataset.X.pt) --> X being the shard id.
If shard X has been loaded Y times, then what it contains has been seen Y times (or maybe Y-1 if you take into account the batch is still in queue).
You can also have a look at this topic.
OK, so that’s about 1000 sentences per training step in my runs on en-de WMT14.
And each sentence is seen about 40 times in 200K steps.
Interesting round number.
But it’s not in any of the parameters above!
I could play numerology . .
batch_size/accum_count = ~1000 . .
BTW This is all interesting.
I found the recall of the late part of my training set is 4 BLEU above the test set (vs the recall of the early part is no better than the test set).
Surprising given the data has been seen 39 other times!
But I guess it makes sense for the last data to have a larger memory effect in the parameters given each batch effectively partially erases previous training.