I do multi-GPU parallel training on a machine with 4 titan X pascal GPUs. I trained with 3 GPUs successfully. But when increased to 4 GPUs, I got "FATAL THREAD PANIC: (pcall) not enough memory" error. After googling I found It is probably due to the 1~2GB memory limit of luajit. My model is quite big. Roughly, the encoders and decoders are all 8-layer residual LSTMs with rnn_size 500. Training set contains ~20M sentences. Source and target vocab size are 20k and 50k respectively. Word_vec_size is 500.
But I know that this limit applies to only lua objects/tables, torch tensors are not subject to this limit. What's more, OpenNMT has replaced some large lua tables with tds vectors to mitigate this issue.
What is the main source of non-tensor memory consumption in OpenNMT that scales with parallel count? Can I reduce/free unnecessary memory usage to support training on more cards?
The only solution that comes into my mind is to reduce max_batch_size, but it will slow down training.
BTW, switching from luajit to lua5.2 can indeed solve the problem. But I found this slows down training by ~20%.. So it is not considered an ideal solution.