how many samples would 1 training step cover together with accumulation?
For example if you see from log output 1/1000 training step and you have 2 gpus, 6 batchsize, 2 accum_count… then does that 1 training step include 2 x 6 x 2 samples? or does it only include 2 x 6 samples and you do not see that actual the optimization happening?
Hi,
I found this topic really interesting.
If a dataset has 4500 examples and we have a batch_size of 4096, 1 GPU and accum_count is 4. I guess in each steps the dataset is loaded 4 times. Is it correct?