8-GPU training error during garbage collection

Attempting to start sync multi-GPU (-no_nccl) using all 8 V100 cards on a p3.16xlarge EC2 instance. I’m using LuaJIT with the latest version (7db30090a7ff4055a6062a03b2637ffb8b32374a) of OpenNMT on a Ubuntu CUDA 9 image.

I’m seeing the following error as it’s preparing the GPUs for training:
FATAL THREAD PANIC: (write) /root/torch/install/share/lua/5.1/torch/File.lua:141: Unwritable object <userdata> at <?>.callback.closure.data.DynamicDataset.ddr.DynamicDataRepository.preprocessor.Preprocessor.pool.__gc__

Just before it died, the nvidia-smi output was:
0 24715 C /root/torch/install/bin/luajit 3838MiB
1 24715 C /root/torch/install/bin/luajit 2368MiB
2 24715 C /root/torch/install/bin/luajit 2368MiB
3 24715 C /root/torch/install/bin/luajit 2368MiB
4 24715 C /root/torch/install/bin/luajit 2368MiB
5 24715 C /root/torch/install/bin/luajit 2368MiB
6 24715 C /root/torch/install/bin/luajit 638MiB
7 24715 C /root/torch/install/bin/luajit 618MiB

config file contents:
train_dir = /data1/clientData/dell/jpn/exp01a/data
src_suffix = .en
tgt_suffix = .ja
valid_src = /data1/clientData/dell/jpn/exp01a/data/dev/dev.en
valid_tgt = /data1/clientData/dell/jpn/exp01a/data/dev/dev.ja
src_vocab = /data1/clientData/dell/jpn/exp01a/vocab_en.dict
tgt_vocab = /data1/clientData/dell/jpn/exp01a/vocab_ja.dict
src_vocab_size = 45000
tgt_vocab_size = 45000
tok_src_bpe_model = /data1/clientData/dell/jpn/exp01a/bpe_en.model
tok_tgt_bpe_model = /data1/clientData/dell/jpn/exp01a/bpe_ja.model
src_seq_length = 50
tgt_seq_length = 80
layers = 4
rnn_size = 1000
encoder_type = brnn
gsample = 4356295
gsample_dist = /data1/clientData/dell/jpn/sample.dist
end_epoch = 50
start_decay_at = 15
decay_method = restart
save_model = /data1/clientData/dell/jpn/exp01a/models/dell_exp01a_model
gpuid = 1,2,3,4,5,6,7,8
log_file = /data1/clientData/dell/jpn/exp01a/train.log

Could you try setting -preprocess_pthreads 1?

The training and preprocessing are using 2 different and nested thread pools. That should be the issue.

It’s not happening during preprocessing…

[11/02/17 13:24:57 INFO]    - word embeddings size: 500
[11/02/17 13:24:58 INFO]    - attention: global (general)
[11/02/17 13:24:58 INFO]    - structure: cell = LSTM; layers = 4; rnn_size = 1000; dropout = 0.3 (naive)
[11/02/17 13:24:58 INFO]  * Bridge: copy
[11/02/17 13:29:35 INFO] Initializing parameters...
[11/02/17 13:29:39 INFO]  * number of parameters: 191706281
[11/02/17 13:29:39 INFO] Preparing memory optimization...
[11/02/17 13:30:28 INFO]  * sharing 72% of output/gradInput tensors memory between clones
[11/02/17 13:48:10 INFO] Start training from epoch 1 to 50...
[11/02/17 13:48:10 INFO]

will it still help?

Yes, a training thread is trying to serialize the preprocessing thread pool which is not allowed. You should try disabling the preprocessing multi threading for now.

OK, I tried that last night. Different error, but still crashed:

FATAL THREAD PANIC: (pcall) not enough memory
FATAL THREAD PANIC: (pcall) not enough memory
PANIC: unprotected error in call to Lua API (not enough memory)
THCudaCheck FAIL file=/root/torch/extra/cunn/lib/THCUNN/generic/ClassNLLCriterion.cu line=87 error=8 : invalid device function
PANIC: unprotected error in call to Lua API (not enough memory)

When I run this on a single GPU, the model uses around 6.5GB.

This instance has 480GB of RAM and each V100 card has 16GB of VRAM, so “not enough memory” is a bit humorous here.

Are you using LuaJIT? As Torch threads involve a lot of context serialization, I think the LuaJIT memory limit is rapidly reached. Could you try with Lua 5.2 instead?

Yes. Sorry, I thought I had seen in another thread that there was a workaround that allowed us to use LuaJIT for mulit-GPU trainings.

Will I be able to increase -pnumber_pthreads while using 5.2?

No, you still need to disable the preprocessing threads unfortunately.

Also worthy of note that Torch5.2 doesn’t compile out of the box with cuda9 and cudnn8. Rather than go down the rabbit hole in trying (likely in vain) to get that configuration to work, I switched to cuda8/cudnn6, and the install went smoothly.

Just launched train.lua on the new image; fingers crossed… :slight_smile: