Thanks, Guillaume! The reason I asked is when there is an error about a somehow bigger vocab_size, batch_size, or valid_batch_size, it usually mentions GPU memory for one GPU, which made me wonder if each GPU should have enough memory or if multiple GPUs can have the required memory cumulatively.
Just a note in case someone comes across this thread: batch_size can be a tricky parameter in this case. If it is on multiple GPUs, say 2, it will be sort of multiplied by 2 while if it is on one GPU, it will be the same as in the config.
@guillaumekln Now, my questions is: if we use one GPU with double memory (e.g. 20GB), and increase the batch_size to make use of this memory, will the quality be worse than using 2 GPUs with the same cumulative memory (e.g. 10GB each) as the gradient will be averaged over too many examples? Thanks, Guillaume!
For the record, I find this discussion interesting as I wonder how it is different from / similar to how OpenNMT-py for example uses multiple GPUs.
Yes, the batch_size parameter correponds to the batch size on each GPU.
It will be the same if you set the batch_size to N when training on 2 GPUs, and set it to 2N on the single GPU. In both cases the gradients are averaged over 2N examples.