Not enough memory error in multi-GPU training

xiaoda99 · July 29, 2017, 3:14pm

I do multi-GPU parallel training on a machine with 4 titan X pascal GPUs. I trained with 3 GPUs successfully. But when increased to 4 GPUs, I got “FATAL THREAD PANIC: (pcall) not enough memory” error. After googling I found It is probably due to the 1~2GB memory limit of luajit. My model is quite big. Roughly, the encoders and decoders are all 8-layer residual LSTMs with rnn_size 500. Training set contains ~20M sentences. Source and target vocab size are 20k and 50k respectively. Word_vec_size is 500.
But I know that this limit applies to only lua objects/tables, torch tensors are not subject to this limit. What’s more, OpenNMT has replaced some large lua tables with tds vectors to mitigate this issue.

What is the main source of non-tensor memory consumption in OpenNMT that scales with parallel count? Can I reduce/free unnecessary memory usage to support training on more cards?
The only solution that comes into my mind is to reduce max_batch_size, but it will slow down training.

BTW, switching from luajit to lua5.2 can indeed solve the problem. But I found this slows down training by ~20%… So it is not considered an ideal solution.

jean.senellart · July 31, 2017, 2:27am

Hi @xiaoda99 - this is a known issue on which we need to work. Are you using sync or async training? Can you open an issue on github giving as much details as you can on memory usage (CPU/GPU) for 1-4 process?

thanks
Jean

xiaoda99 · August 1, 2017, 12:24pm

Hi jean, I use sync training. I’ve opened an issue on github. Pls check it. Thanks!