Hello to everyone reading this.
I have been playing with the batch size looking to optimize the memory usage and reduce training time.
(Running OpenNMT 0.3)
What I noticed is that 2x the default batch size would lead to 1.5x slower iterations but of course 2x less iterations. Which in total would lead to 0.75x of the total consumed time as compared to the default batch size. That all good, but now comes the issue - perplexity per epoch for the larger batch size was much higher than the perplexity for the default batch size.
And my question: are there some guidelines about how to set up the batch size? Could the higher perplexity be explained or can it be reduced?
This seems very empirical to me. Larger batch size means better usage of GPU resources but fuzzier gradients so slower convergence.
The default value seems like a sweet spot but we should ideally plot learning curves of different batch sizes. Previous experiments also showed that a maximum batch size larger than 128 (more or less) can alter the training convergence.
128 up and down. batch size increase, it can improve the training speed. But it is possible that the convergence results are worse. If the memory size allows, you can consider starting from a relatively large value. Because
the batch size is too large, generally do not have too much impact on
the results, and batch size is too small, the results may be poor.
Indeed - the convergence is not as good as for the default setting.
I am using two GPUs – one with 12GB of RAM and one with 4GB. I noticed that during training the complete memory of the former is not utilized while it would be great if it is, since it is pointless if it is not being used 100%. That’s the reason I increased the batch size. And, as I mentioned earlier, the time decreases (as memory utilization increased).
I will play a bit more and try to see if I can think of some way to improve.
Any suggestions are more than welcome.