how many samples would 1 training step cover together with accumulation?
For example if you see from log output 1/1000 training step and you have 2 gpus, 6 batchsize, 2 accum_count… then does that 1 training step include 2 x 6 x 2 samples? or does it only include 2 x 6 samples and you do not see that actual the optimization happening?
ah alright that was also one question that the print out about loading data seemed off. Thanks!!!
I found this topic really interesting.
If a dataset has 4500 examples and we have a
batch_size of 4096, 1 GPU and
accum_count is 4. I guess in each steps the dataset is loaded 4 times. Is it correct?
batch_size of 4096 is probably tokens, not examples?
But if batch_type were to be examples, yes that would be the idea.
I see. Are examples taken until 4096 tokens are filled?