OpenNMT Forum

Training stuck (multi GPU, transformer)


I am running a transformer on multiple GPUs (4 in total)
I use the following command/setup:

python $OPENNMT/
-data $ENGINEDIR/data/ready_to_train -save_model $MODELDIR/model
-layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8
-encoder_type transformer -decoder_type transformer -position_encoding
-train_steps 9750 -max_generator_batches 2 -dropout 0.1
-batch_size 4069 -batch_type tokens -normalization tokens -accum_count 2
-optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 2000 -learning_rate 2
-max_grad_norm 0 -param_init 0 -param_init_glorot
-label_smoothing 0.1 -valid_steps 250 -save_checkpoint_steps 250
-report_every 100
-world_size 4 -gpu_ranks 0 1 2 3
-log_file $MODELDIR/train.log

What happens is that it halts (gets stuck) at ‘starting training loop without validation…’

[2019-08-28 11:08:26,181 INFO] encoder: 42993664
[2019-08-28 11:08:26,181 INFO] decoder: 75336441
[2019-08-28 11:08:26,181 INFO] * number of parameters: 118330105
[2019-08-28 11:08:26,183 INFO] Starting training on GPU: [0, 1, 2, 3]
[2019-08-28 11:08:26,183 INFO] Start training loop without validation…

Any ideas?

Kind regards,

The interesting thing is that actually, sometimes it works, sometimes it doesn’t.

Thanks for any suggestions or ideas.

By the way I am running the version of OpenNMT-oy I pulled on 26th Aug.

Did you try with the latest PyTorch version?

Hi Guillaume,

I tried with 1.1 and 1.2; python 3.7 and python 3.6;…
My current setup is: torch 1.1 and python 3.7.3 and cuda 10.0 on 435.21 nvidia driver.
Still sometimes it works just as supposed and sometimes it is stuck when loading the data. I can see in through the nvidia-smi that only 10% of the memory of all GPUs is consumed and nothing progresses.

But, again, sometimes it just works.

A clarification. I replicated the conda environment and updated the torch to 1.2. The training is stuck every time.

[2019-09-13 08:45:58,532 INFO] Start training loop without validation...

And stays there for ever.

Maybe @vince62s knows more about this issue.

This does not look like the same issue where it hangs at a given step in the training.

Yours looks like more an issue with the initial distributed steps.
I would set a verbose level to 2 to print what is happening.

I’m having same issue- I set verbose to level 2, the only additional information I get is:

[2019-11-05 03:29:13,333 INFO] Start training loop without validation...
[2019-11-05 03:29:21,467 INFO] number of examples: 100000
/home/rawkintrevo/.local/lib/python3.6/site-packages/torchtext/data/ UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  var = torch.tensor(arr, dtype=self.dtype, device=device)
[2019-11-05 03:29:28,189 INFO] GpuRank 0: index: 0
[2019-11-05 03:29:28,189 INFO] GpuRank 0: reduce_counter: 1                             n_minibatch 1

so far I have not been to get it to work even occasionally (as OP does).
I’m Torch 1.2 / Python 3.6 / CUDA 10.1 / NVidia 430.26. fwiw. I’m also seeing shockingly low GPU utilization.


(EDIT: updated original post removing GPU Rank 1: output which was a result of me monkeying in the source code per

What command line are you running?

Sorry- I just fixed. My issue was with torch. fairseq also would hang.

Some diagnostics I did to solve my problem:

NCCL is what is being used in multi vs single, so I started there. I got it squawking with


From there we see its actually hanging on AllGather operations. So as a quick nuclear option I did:


Which worked (but is not ideal).

A bit of monkeying around later, I found I was able to turn NCCL_P2P and set the level to 2 (basically don’t let it go through the CPU).

export NCCL_P2P_LEVEL=2

I’m still having OpenNMT issues, but fairseq works, and the new OpenNMT issues I am fairly sure are unrelated to this thread. @dimitarsh1 that might help you(?).