OpenNMT Forum

Training stuck (multi GPU, transformer)


I am running a transformer on multiple GPUs (4 in total)
I use the following command/setup:

python $OPENNMT/
-data $ENGINEDIR/data/ready_to_train -save_model $MODELDIR/model
-layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8
-encoder_type transformer -decoder_type transformer -position_encoding
-train_steps 9750 -max_generator_batches 2 -dropout 0.1
-batch_size 4069 -batch_type tokens -normalization tokens -accum_count 2
-optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 2000 -learning_rate 2
-max_grad_norm 0 -param_init 0 -param_init_glorot
-label_smoothing 0.1 -valid_steps 250 -save_checkpoint_steps 250
-report_every 100
-world_size 4 -gpu_ranks 0 1 2 3
-log_file $MODELDIR/train.log

What happens is that it halts (gets stuck) at ‘starting training loop without validation…’

[2019-08-28 11:08:26,181 INFO] encoder: 42993664
[2019-08-28 11:08:26,181 INFO] decoder: 75336441
[2019-08-28 11:08:26,181 INFO] * number of parameters: 118330105
[2019-08-28 11:08:26,183 INFO] Starting training on GPU: [0, 1, 2, 3]
[2019-08-28 11:08:26,183 INFO] Start training loop without validation…

Any ideas?

Kind regards,

The interesting thing is that actually, sometimes it works, sometimes it doesn’t.

Thanks for any suggestions or ideas.

By the way I am running the version of OpenNMT-oy I pulled on 26th Aug.

Did you try with the latest PyTorch version?

Hi Guillaume,

I tried with 1.1 and 1.2; python 3.7 and python 3.6;…
My current setup is: torch 1.1 and python 3.7.3 and cuda 10.0 on 435.21 nvidia driver.
Still sometimes it works just as supposed and sometimes it is stuck when loading the data. I can see in through the nvidia-smi that only 10% of the memory of all GPUs is consumed and nothing progresses.

But, again, sometimes it just works.

A clarification. I replicated the conda environment and updated the torch to 1.2. The training is stuck every time.

[2019-09-13 08:45:58,532 INFO] Start training loop without validation...

And stays there for ever.

Maybe @vince62s knows more about this issue.

This does not look like the same issue where it hangs at a given step in the training.

Yours looks like more an issue with the initial distributed steps.
I would set a verbose level to 2 to print what is happening.