How many times is my data being trained?

paulkp · May 4, 2020, 1:18am

@guillaumekln @francoishernandez
Hi everyone:
How do I work out how many times my sentences are being trained (or if even at least once)?
I really have no idea which of these parameters (below) to use to work that out.

eg
sentence count (per language) = 4.5 million (eg en-de)
steps = 200,000
batch_type tokens
batch_size 4096
train_steps 200000
max_generator_batches 32
normalization tokens
accum_count 4

francoishernandez · May 4, 2020, 12:37pm

Hey Paul,
The easiest is probably to check in your logs how many times each shard has been loaded at any point.
(Loading dataset from.... your_dataset.X.pt) --> X being the shard id.
If shard X has been loaded Y times, then what it contains has been seen Y times (or maybe Y-1 if you take into account the batch is still in queue).

You can also have a look at this topic.

paulkp · May 5, 2020, 12:57am

Thnx.
OK, so that’s about 1000 sentences per training step in my runs on en-de WMT14.
And each sentence is seen about 40 times in 200K steps.

Interesting round number.
But it’s not in any of the parameters above!
I could play numerology . .

batch_size/accum_count = ~1000 . .

BTW This is all interesting.
I found the recall of the late part of my training set is 4 BLEU above the test set (vs the recall of the early part is no better than the test set).

Surprising given the data has been seen 39 other times!
But I guess it makes sense for the last data to have a larger memory effect in the parameters given each batch effectively partially erases previous training.