Attempting to start sync multi-GPU (-no_nccl) using all 8 V100 cards on a p3.16xlarge EC2 instance. I’m using LuaJIT with the latest version (7db30090a7ff4055a6062a03b2637ffb8b32374a) of OpenNMT on a Ubuntu CUDA 9 image.
I’m seeing the following error as it’s preparing the GPUs for training: FATAL THREAD PANIC: (write) /root/torch/install/share/lua/5.1/torch/File.lua:141: Unwritable object <userdata> at <?>.callback.closure.data.DynamicDataset.ddr.DynamicDataRepository.preprocessor.Preprocessor.pool.__gc__
Just before it died, the nvidia-smi output was:
0 24715 C /root/torch/install/bin/luajit 3838MiB
1 24715 C /root/torch/install/bin/luajit 2368MiB
2 24715 C /root/torch/install/bin/luajit 2368MiB
3 24715 C /root/torch/install/bin/luajit 2368MiB
4 24715 C /root/torch/install/bin/luajit 2368MiB
5 24715 C /root/torch/install/bin/luajit 2368MiB
6 24715 C /root/torch/install/bin/luajit 638MiB
7 24715 C /root/torch/install/bin/luajit 618MiB
Yes, a training thread is trying to serialize the preprocessing thread pool which is not allowed. You should try disabling the preprocessing multi threading for now.
OK, I tried that last night. Different error, but still crashed:
FATAL THREAD PANIC: (pcall) not enough memory
FATAL THREAD PANIC: (pcall) not enough memory
PANIC: unprotected error in call to Lua API (not enough memory)
THCudaCheck FAIL file=/root/torch/extra/cunn/lib/THCUNN/generic/ClassNLLCriterion.cu line=87 error=8 : invalid device function
PANIC: unprotected error in call to Lua API (not enough memory)
When I run this on a single GPU, the model uses around 6.5GB.
This instance has 480GB of RAM and each V100 card has 16GB of VRAM, so “not enough memory” is a bit humorous here.
Are you using LuaJIT? As Torch threads involve a lot of context serialization, I think the LuaJIT memory limit is rapidly reached. Could you try with Lua 5.2 instead?
Also worthy of note that Torch5.2 doesn’t compile out of the box with cuda9 and cudnn8. Rather than go down the rabbit hole in trying (likely in vain) to get that configuration to work, I switched to cuda8/cudnn6, and the install went smoothly.
Just launched train.lua on the new image; fingers crossed…