Multi-GPUs vs one GPU, same memory

ymoslem · May 11, 2021, 6:15am

Hello Colleagues!

Theoretically speaking, if we have 2 GPUs with 10GB memory each (i.e. two=20GB) vs. 1 GPU with 20GB memory, will the training performance be the same?

I know there were a lot of related discussions on the forums, but I would like to know the current state of this for OpenNMT-py and OpenNMT-tf.

Many thanks!
Yasmin

guillaumekln · May 11, 2021, 2:52pm

Hi,

Do you mean in terms of training speed or evaluation score?

ymoslem · May 11, 2021, 3:22pm

Speed. Thanks, Guillaume!

guillaumekln · May 11, 2021, 4:39pm

It would depend on the specifics of the GPU, but I think it should be about the same speed if the total batch size is the same.

ymoslem · May 11, 2021, 5:15pm

Thanks, Guillaume! The reason I asked is when there is an error about a somehow bigger vocab_size, batch_size, or valid_batch_size, it usually mentions GPU memory for one GPU, which made me wonder if each GPU should have enough memory or if multiple GPUs can have the required memory cumulatively.

ymoslem · May 12, 2021, 5:20pm

Just a note in case someone comes across this thread: batch_size can be a tricky parameter in this case. If it is on multiple GPUs, say 2, it will be sort of multiplied by 2 while if it is on one GPU, it will be the same as in the config.

@guillaumekln Now, my questions is: if we use one GPU with double memory (e.g. 20GB), and increase the batch_size to make use of this memory, will the quality be worse than using 2 GPUs with the same cumulative memory (e.g. 10GB each) as the gradient will be averaged over too many examples? Thanks, Guillaume!

For the record, I find this discussion interesting as I wonder how it is different from / similar to how OpenNMT-py for example uses multiple GPUs.

Kind regards,
Yasmin

guillaumekln · May 12, 2021, 5:32pm

Yes, the batch_size parameter correponds to the batch size on each GPU.

It will be the same if you set the batch_size to N when training on 2 GPUs, and set it to 2N on the single GPU. In both cases the gradients are averaged over 2N examples.