OpenNMT Forum

Training stuck (multi GPU, transformer)

Hello

I am running a transformer on multiple GPUs (4 in total)
I use the following command/setup:

python $OPENNMT/train.py
-data $ENGINEDIR/data/ready_to_train -save_model $MODELDIR/model
-layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8
-encoder_type transformer -decoder_type transformer -position_encoding
-train_steps 9750 -max_generator_batches 2 -dropout 0.1
-batch_size 4069 -batch_type tokens -normalization tokens -accum_count 2
-optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 2000 -learning_rate 2
-max_grad_norm 0 -param_init 0 -param_init_glorot
-label_smoothing 0.1 -valid_steps 250 -save_checkpoint_steps 250
-report_every 100
-world_size 4 -gpu_ranks 0 1 2 3
-log_file $MODELDIR/train.log

What happens is that it halts (gets stuck) at ‘starting training loop without validation…’

[2019-08-28 11:08:26,181 INFO] encoder: 42993664
[2019-08-28 11:08:26,181 INFO] decoder: 75336441
[2019-08-28 11:08:26,181 INFO] * number of parameters: 118330105
[2019-08-28 11:08:26,183 INFO] Starting training on GPU: [0, 1, 2, 3]
[2019-08-28 11:08:26,183 INFO] Start training loop without validation…

Any ideas?

Kind regards,
Dimitar

The interesting thing is that actually, sometimes it works, sometimes it doesn’t.

Thanks for any suggestions or ideas.

By the way I am running the version of OpenNMT-oy I pulled on 26th Aug.

Did you try with the latest PyTorch version?

Hi Guillaume,

I tried with 1.1 and 1.2; python 3.7 and python 3.6;…
My current setup is: torch 1.1 and python 3.7.3 and cuda 10.0 on 435.21 nvidia driver.
Still sometimes it works just as supposed and sometimes it is stuck when loading the data. I can see in through the nvidia-smi that only 10% of the memory of all GPUs is consumed and nothing progresses.

But, again, sometimes it just works.

A clarification. I replicated the conda environment and updated the torch to 1.2. The training is stuck every time.

[2019-09-13 08:45:58,532 INFO] Start training loop without validation...

And stays there for ever.

Maybe @vince62s knows more about this issue.

This does not look like the same issue where it hangs at a given step in the training.

Yours looks like more an issue with the initial distributed steps.
I would set a verbose level to 2 to print what is happening.