When we use e.g. world_size 2 and use two GPUs in total, at each step, the model has seen two times the data that was used compared to if we had only used one GPU. Do the parameters that we give (e.g. warmup_steps) scale/normalize with world_size? If not, should they, and should we manually change it?
So if we run an experiment with 8000 warm_up steps on a single GPU, and we want to replicate the behaviour as closely as possible on two GPUs, should we reduce the warmup_steps to 4000?