Training get stuck and GPU is 100% used at 257100 or 507100 steps

yaren · June 11, 2019, 3:38pm

Hi, everyone. I got a very strange problem.
I am training a large dataset whith this script:

python train.py -data ez/ze/ze -save_model ez/model/ze-model -layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8 -encoder_type transformer -decoder_type transformer -position_encoding -train_steps 1500000 -max_generator_batches 2 -dropout 0.1 -batch_size 3008 -batch_type tokens -normalization tokens -accum_count 2 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 -max_grad_norm 0 -param_init 0 -param_init_glorot -label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000 -log_file ze3log.train -keep_checkpoint 200 -world_size 2 -gpu_ranks 0 1

but this training will get stuck at 257100 steps.
and then I use ctrl+c quit the training, and continue with -train_from ; then it will get stuck at 507100 steps.

when it’s stucked. nvidia-smi display like (GPU-Util all 100%):

and the log stopped with this line:

[2019-06-11 22:16:00,046 INFO] Step 507100/1500000; acc: 64.91; ppl: 4.31; xent: 1.46; lr: 0.00012; 11713/13772 tok/s; 223196 sec

this problem is inevitable at 257100 or 507100 steps.
I hope someone can help me .thanks.

vince62s · June 11, 2019, 7:43pm

it’s annoying but out of many runs we experienced something similiar on a specific language pair without finding the reason or a solution.

1-800-BAD-CODE · July 10, 2019, 1:06pm

I’ve dealt with this as well, but have not found a solution. Some info in case someone has time:

Using the same bitext data, it does not happen with versions of OpenNMT/Pytorch from February. On the current stable pytorch + master onmt this occurs.
It is not caused by a particular training example. After the first time getting stuck, I rearranged my train.xx.pt files and the training got stuck on the same step but using different data.
It seems to be deadlocked in gettime() (see stack trace below). The closest issue I can find is https://github.com/NVIDIA/nccl/issues/48 which blames it in the NCCL version. If it’s the same issue, it should be fixed in NCCL 2.1. I have a lower version but work on a cluster, so I cannot test if that works.

(gdb) bt
#0  0x00007ffc215056c2 in clock_gettime ()
#1  0x00007f57f3bbe96d in clock_gettime () from /lib64/libc.so.6
#2  0x00007f57d6033b1e in ?? () from /usr/lib64/nvidia/libcuda.so.1
#3  0x00007f57d60c8b93 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#4  0x00007f57d60e7c09 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#5  0x00007f57d60113a6 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#6  0x00007f57d5f2a48e in ?? () from /usr/lib64/nvidia/libcuda.so.1
#7  0x00007f57d5f2cfe6 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#8  0x00007f57d6079e02 in cuMemcpyDtoHAsync_v2 () from /usr/lib64/nvidia/libcuda.so.1
#9  0x00007f57dae8b4bf in ?? () from /path/to/conda/lib/python3.6/site-packages/torch/lib/libcudart-f7fdd8d7.so.9.0
#10 0x00007f57dae68573 in ?? () from /path/to/conda/lib/python3.6/site-packages/torch/lib/libcudart-f7fdd8d7.so.9.0
#11 0x00007f57daea1d86 in cudaMemcpyAsync () from /path/to/conda/lib/python3.6/site-packages/torch/lib/libcudart-f7fdd8d7.so.9.0
#12 0x00007f573d3b9233 in at::native::_local_scalar_dense_cuda(at::Tensor const&)::{lambda()#1}::operator()() const ()
   from /path/to/conda/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so
#13 0x00007f573d3bbbb7 in at::native::_local_scalar_dense_cuda(at::Tensor const&) () from /path/to/conda/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so
#14 0x00007f573c3f0902 in at::CUDAType::_local_scalar_dense(at::Tensor const&) const () from /path/to/conda/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so
#15 0x00007f57db6d8685 in torch::autograd::VariableType::_local_scalar_dense(at::Tensor const&) const ()
   from /path/to/conda/lib/python3.6/site-packages/torch/lib/libtorch.so.1
#16 0x00007f57ddaef92a in at::native::item(at::Tensor const&) () from /path/to/conda/lib/python3.6/site-packages/torch/lib/libcaffe2.so
#17 0x00007f57dddf9e15 in at::TypeDefault::item(at::Tensor const&) const () from /path/to/conda/lib/python3.6/site-packages/torch/lib/libcaffe2.so
#18 0x00007f57db8d2418 in torch::autograd::VariableType::item(at::Tensor const&) const () from /path/to/conda/lib/python3.6/site-packages/torch/lib/libtorch.so.1
#19 0x00007f57e7c701f6 in torch::autograd::dispatch_to_CDouble(at::Tensor const&) () from /path/to/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#20 0x00007f57e7c70646 in torch::autograd::THPVariable_item(_object*, _object*) () from /path/to/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#21 0x000055f1b7bd279a in _PyCFunction_FastCallDict ()
#22 0x000055f1b7c61acc in call_function ()
#23 0x000055f1b7c844ba in _PyEval_EvalFrameDefault ()
...
#58 0x000055f1b7c85279 in _PyEval_EvalFrameDefault ()
#59 0x000055f1b7c5ca39 in PyEval_EvalCodeEx ()

vince62s · July 10, 2019, 6:36pm

interesting on NCCL 2.4.2 I don’t seem to have the issue.

we will check the other servers.

1-800-BAD-CODE · September 3, 2019, 1:23pm

For what it’s worth, I found that the latest version of pytorch fixes this. When installing pytorch via pip or conda, a lot of the cuda stuff comes precompiled. It must be a big in the earlier versions of cuda code that gets used by OpenNMT-py’s newer producer/consumer strategy.

So basically update torch if this happens.