When we use e.g. world_size 2 and use two GPUs in total, at each step, the model has seen two times the data that was used compared to if we had only used one GPU. Do the parameters that we give (e.g. warmup_steps) scale/normalize with world_size? If not, should they, and should we manually change it?
So if we run an experiment with 8000 warm_up steps on a single GPU, and we want to replicate the behaviour as closely as possible on two GPUs, should we reduce the warmup_steps to 4000?
Did not experiment much on this topic, and some behaviors are not necessarily straightforward with such models.
Maybe have a look at this paper (section 4.8 for instance): https://ufal.mff.cuni.cz/pbml/110/art-popel-bojar.pdf
I guess this is more of a theoretical than an experimental question, though. (So perhaps I should have labeled it as usage instead of research.)
If we run the model with two GPUs, each model processes data, so optimizer.step() should have a larger sample to optimize with. In other words: more data has been processed, which means the gradients you get are a better representation than when you’d use one GPU. That is already a big difference.
Because of this, settings like warmup steps give a different result altogether as well compared to one GPU. If you’d use one GPU, 8000 steps means that the optimizer has only seen 8000 iterations worth of data of size batch_size. But if you use two GPUs, 8000 steps means the optimizer has seen 8000 * 2 * batch_size data samples.
Or am I wrong?