Training stuck (multi GPU, transformer)

Hello

I am running a transformer on multiple GPUs (4 in total)
I use the following command/setup:

python $OPENNMT/train.py
-data $ENGINEDIR/data/ready_to_train -save_model $MODELDIR/model
-layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8
-encoder_type transformer -decoder_type transformer -position_encoding
-train_steps 9750 -max_generator_batches 2 -dropout 0.1
-batch_size 4069 -batch_type tokens -normalization tokens -accum_count 2
-optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 2000 -learning_rate 2
-max_grad_norm 0 -param_init 0 -param_init_glorot
-label_smoothing 0.1 -valid_steps 250 -save_checkpoint_steps 250
-report_every 100
-world_size 4 -gpu_ranks 0 1 2 3
-log_file $MODELDIR/train.log

What happens is that it halts (gets stuck) at ‘starting training loop without validation…’

[2019-08-28 11:08:26,181 INFO] encoder: 42993664
[2019-08-28 11:08:26,181 INFO] decoder: 75336441
[2019-08-28 11:08:26,181 INFO] * number of parameters: 118330105
[2019-08-28 11:08:26,183 INFO] Starting training on GPU: [0, 1, 2, 3]
[2019-08-28 11:08:26,183 INFO] Start training loop without validation…

Any ideas?

Kind regards,
Dimitar

The interesting thing is that actually, sometimes it works, sometimes it doesn’t.

Thanks for any suggestions or ideas.

By the way I am running the version of OpenNMT-oy I pulled on 26th Aug.

Did you try with the latest PyTorch version?

Hi Guillaume,

I tried with 1.1 and 1.2; python 3.7 and python 3.6;…
My current setup is: torch 1.1 and python 3.7.3 and cuda 10.0 on 435.21 nvidia driver.
Still sometimes it works just as supposed and sometimes it is stuck when loading the data. I can see in through the nvidia-smi that only 10% of the memory of all GPUs is consumed and nothing progresses.

But, again, sometimes it just works.

A clarification. I replicated the conda environment and updated the torch to 1.2. The training is stuck every time.

[2019-09-13 08:45:58,532 INFO] Start training loop without validation...

And stays there for ever.

Maybe @vince62s knows more about this issue.

This does not look like the same issue where it hangs at a given step in the training.

Yours looks like more an issue with the initial distributed steps.
I would set a verbose level to 2 to print what is happening.

I’m having same issue- I set verbose to level 2, the only additional information I get is:

[2019-11-05 03:29:13,333 INFO] Start training loop without validation...
[2019-11-05 03:29:21,467 INFO] number of examples: 100000
/home/rawkintrevo/.local/lib/python3.6/site-packages/torchtext/data/field.py:359: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  var = torch.tensor(arr, dtype=self.dtype, device=device)
[2019-11-05 03:29:28,189 INFO] GpuRank 0: index: 0
[2019-11-05 03:29:28,189 INFO] GpuRank 0: reduce_counter: 1                             n_minibatch 1

so far I have not been to get it to work even occasionally (as OP does).
I’m Torch 1.2 / Python 3.6 / CUDA 10.1 / NVidia 430.26. fwiw. I’m also seeing shockingly low GPU utilization.

Thoughts?

(EDIT: updated original post removing GPU Rank 1: output which was a result of me monkeying in the source code per https://github.com/OpenNMT/OpenNMT-py/blob/master/onmt/utils/distributed.py#L29)

What command line are you running?

Sorry- I just fixed. My issue was with torch. fairseq also would hang.

Some diagnostics I did to solve my problem:

NCCL is what is being used in multi vs single, so I started there. I got it squawking with

export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL

From there we see its actually hanging on AllGather operations. So as a quick nuclear option I did:

export NCCL_P2P_DISABLE=1

Which worked (but is not ideal).

A bit of monkeying around later, I found I was able to turn NCCL_P2P and set the level to 2 (basically don’t let it go through the CPU).

export NCCL_P2P_LEVEL=2

I’m still having OpenNMT issues, but fairseq works, and the new OpenNMT issues I am fairly sure are unrelated to this thread. @dimitarsh1 that might help you(?).

Having the same problem. The training is working well with OpenNMT models (rnn, transformer etc.) but I was trying to include a Fairseq model in the OpenNMT pipeline and this problem occurs. My node setting is PyTorch 1.2/1.4 and CUDA 10.0. Fiddling with NCCL settings didn’t help.
I run the same code on another node with PyTorch 1.3.1 and it works.
Still not quite sure what is the key but hope my info helps.

Has anyone definitively solved this issue? I am experiencing the same problem. I am on a shared server so I would hazard to guess that something was updated without my knowledge which broke OpenNMT-py. Any suggestions? It seems to work fine when I train on a single GPU but gets stuck in the same postion as OP when I use multi-GPU. Thanks for any help you can provide!

CUDA = 10.2
Pytorch = 1.3.1

CUDA_VISIBLE_DEVICES=2,3,4 python3 train.py -data $DATA_DIR -save_model $SAVE_DIR -layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8 -encoder_type transformer -decoder_type transformer -position_encoding -train_steps 200000 -max_generator_batches 2 -dropout 0.1 -batch_size 2048 -batch_type tokens -normalization tokens -accum_count 1 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 -max_grad_norm 0 -param_init 0 -param_init_glorot -label_smoothing 0.1 -valid_steps 10000 -valid_batch_size 16 -save_checkpoint_steps 10000 -early_stopping 4 -world_size 3 -gpu_ranks 0 1 2