Just a note in case someone comes across this thread:
batch_size can be a tricky parameter in this case. If it is on multiple GPUs, say 2, it will be sort of multiplied by 2 while if it is on one GPU, it will be the same as in the config.
@guillaumekln Now, my questions is: if we use one GPU with double memory (e.g. 20GB), and increase the
batch_size to make use of this memory, will the quality be worse than using 2 GPUs with the same cumulative memory (e.g. 10GB each) as the gradient will be averaged over too many examples? Thanks, Guillaume!
For the record, I find this discussion interesting as I wonder how it is different from / similar to how OpenNMT-py for example uses multiple GPUs.