Training stuck (multi GPU, transformer)

Hello

I am running a transformer on multiple GPUs (4 in total)
I use the following command/setup:

python $OPENNMT/train.py
-data $ENGINEDIR/data/ready_to_train -save_model $MODELDIR/model
-layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8
-encoder_type transformer -decoder_type transformer -position_encoding
-train_steps 9750 -max_generator_batches 2 -dropout 0.1
-batch_size 4069 -batch_type tokens -normalization tokens -accum_count 2
-optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 2000 -learning_rate 2
-max_grad_norm 0 -param_init 0 -param_init_glorot
-label_smoothing 0.1 -valid_steps 250 -save_checkpoint_steps 250
-report_every 100
-world_size 4 -gpu_ranks 0 1 2 3
-log_file $MODELDIR/train.log

What happens is that it halts (gets stuck) at ‘starting training loop without validation…’

[2019-08-28 11:08:26,181 INFO] encoder: 42993664
[2019-08-28 11:08:26,181 INFO] decoder: 75336441
[2019-08-28 11:08:26,181 INFO] * number of parameters: 118330105
[2019-08-28 11:08:26,183 INFO] Starting training on GPU: [0, 1, 2, 3]
[2019-08-28 11:08:26,183 INFO] Start training loop without validation…

Any ideas?

Kind regards,
Dimitar

The interesting thing is that actually, sometimes it works, sometimes it doesn’t.

Thanks for any suggestions or ideas.

By the way I am running the version of OpenNMT-oy I pulled on 26th Aug.

Did you try with the latest PyTorch version?

Hi Guillaume,

I tried with 1.1 and 1.2; python 3.7 and python 3.6;…
My current setup is: torch 1.1 and python 3.7.3 and cuda 10.0 on 435.21 nvidia driver.
Still sometimes it works just as supposed and sometimes it is stuck when loading the data. I can see in through the nvidia-smi that only 10% of the memory of all GPUs is consumed and nothing progresses.

But, again, sometimes it just works.

A clarification. I replicated the conda environment and updated the torch to 1.2. The training is stuck every time.

[2019-09-13 08:45:58,532 INFO] Start training loop without validation...

And stays there for ever.

Maybe @vince62s knows more about this issue.

This does not look like the same issue where it hangs at a given step in the training.

Yours looks like more an issue with the initial distributed steps.
I would set a verbose level to 2 to print what is happening.

I’m having same issue- I set verbose to level 2, the only additional information I get is:

[2019-11-05 03:29:13,333 INFO] Start training loop without validation...
[2019-11-05 03:29:21,467 INFO] number of examples: 100000
/home/rawkintrevo/.local/lib/python3.6/site-packages/torchtext/data/field.py:359: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  var = torch.tensor(arr, dtype=self.dtype, device=device)
[2019-11-05 03:29:28,189 INFO] GpuRank 0: index: 0
[2019-11-05 03:29:28,189 INFO] GpuRank 0: reduce_counter: 1                             n_minibatch 1

so far I have not been to get it to work even occasionally (as OP does).
I’m Torch 1.2 / Python 3.6 / CUDA 10.1 / NVidia 430.26. fwiw. I’m also seeing shockingly low GPU utilization.

Thoughts?

(EDIT: updated original post removing GPU Rank 1: output which was a result of me monkeying in the source code per https://github.com/OpenNMT/OpenNMT-py/blob/master/onmt/utils/distributed.py#L29)

What command line are you running?

Sorry- I just fixed. My issue was with torch. fairseq also would hang.

Some diagnostics I did to solve my problem:

NCCL is what is being used in multi vs single, so I started there. I got it squawking with

export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL

From there we see its actually hanging on AllGather operations. So as a quick nuclear option I did:

export NCCL_P2P_DISABLE=1

Which worked (but is not ideal).

A bit of monkeying around later, I found I was able to turn NCCL_P2P and set the level to 2 (basically don’t let it go through the CPU).

export NCCL_P2P_LEVEL=2

I’m still having OpenNMT issues, but fairseq works, and the new OpenNMT issues I am fairly sure are unrelated to this thread. @dimitarsh1 that might help you(?).

Having the same problem. The training is working well with OpenNMT models (rnn, transformer etc.) but I was trying to include a Fairseq model in the OpenNMT pipeline and this problem occurs. My node setting is PyTorch 1.2/1.4 and CUDA 10.0. Fiddling with NCCL settings didn’t help.
I run the same code on another node with PyTorch 1.3.1 and it works.
Still not quite sure what is the key but hope my info helps.

Has anyone definitively solved this issue? I am experiencing the same problem. I am on a shared server so I would hazard to guess that something was updated without my knowledge which broke OpenNMT-py. Any suggestions? It seems to work fine when I train on a single GPU but gets stuck in the same postion as OP when I use multi-GPU. Thanks for any help you can provide!

CUDA = 10.2
Pytorch = 1.3.1

CUDA_VISIBLE_DEVICES=2,3,4 python3 train.py -data $DATA_DIR -save_model $SAVE_DIR -layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8 -encoder_type transformer -decoder_type transformer -position_encoding -train_steps 200000 -max_generator_batches 2 -dropout 0.1 -batch_size 2048 -batch_type tokens -normalization tokens -accum_count 1 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 -max_grad_norm 0 -param_init 0 -param_init_glorot -label_smoothing 0.1 -valid_steps 10000 -valid_batch_size 16 -save_checkpoint_steps 10000 -early_stopping 4 -world_size 3 -gpu_ranks 0 1 2

I am currently having the same issue. I am training two quite big models using two GPUs per model (overall 4 GPUs). Everything went smoothly until the training was stuck for hours with the following message:

[2024-06-01 05:19:15,193 INFO] Step 9900/120000; acc: 36.7; ppl:  91.9; xent: 4.5; lr: 0.00089; sents:  193594; bsz: 3711/1720/121; 108987/50508 tok/s;   5372 sec;
[2024-06-01 05:20:08,797 INFO] Step 10000/120000; acc: 36.9; ppl:  91.1; xent: 4.5; lr: 0.00088; sents:  195604; bsz: 3651/1792/122; 108991/53475 tok/s;   5425 sec;
[rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=60104, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=60000) ran for 60910 milliseconds before timing out.
[rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=60104, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=60000) ran for 60910 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd2fc581d87 in /home/user/sinarech/.local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7fd2b21f66e6 in /home/user/sinarech/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7fd2b21f9c3d in /home/user/sinarech/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fd2b21fa839 in /home/user/sinarech/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd44a3 (0x7fd2fccd44a3 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x89134 (0x7fd2ff99d134 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x1097dc (0x7fd2ffa1d7dc in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=60104, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=60000) ran for 60910 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd2fc581d87 in /home/user/sinarech/.local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7fd2b21f66e6 in /home/user/sinarech/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7fd2b21f9c3d in /home/user/sinarech/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fd2b21fa839 in /home/user/sinarech/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd44a3 (0x7fd2fccd44a3 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x89134 (0x7fd2ff99d134 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x1097dc (0x7fd2ffa1d7dc in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd2fc581d87 in /home/user/sinarech/.local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xdf6b11 (0x7fd2b1f50b11 in /home/user/sinarech/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd44a3 (0x7fd2fccd44a3 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x89134 (0x7fd2ff99d134 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x1097dc (0x7fd2ffa1d7dc in /lib/x86_64-linux-gnu/libc.so.6)

[2024-06-01 08:19:10,826 INFO] valid stats calculation
                           took: 10742.02825975418 s.
[2024-06-01 08:19:10,828 INFO] Train perplexity: 156.138
[2024-06-01 08:19:10,830 INFO] Train accuracy: 30.8526
[2024-06-01 08:19:10,830 INFO] Sentences processed: 8.77107e+06
[2024-06-01 08:19:10,830 INFO] Average bsz: 3757/1701/110
[2024-06-01 08:19:10,832 INFO] Validation perplexity: 153.026
[2024-06-01 08:19:10,832 INFO] Validation accuracy: 32.316
[2024-06-01 08:19:10,833 INFO] Model is improving ppl: inf --> 153.026.
[2024-06-01 08:19:10,835 INFO] Model is improving acc: -inf --> 32.316.

What can potentially cause this problem and how can I fix it?
This is my config file in case it helps:

# Configuration 1
save_data: /home/user/sinarech/NMT/NMT/models/config_1
log_file: /home/user/sinarech/NMT/NMT/models/config_1/train.log
save_model: /home/user/sinarech/NMT/NMT/models/config_1/model
## Where the vocab(s) will be written
src_vocab: /home/user/sinarech/NMT/NMT/models/config_1/vocab.src
tgt_vocab: /home/user/sinarech/NMT/NMT/models/config_1/vocab.tgt

src_vocab_size: 8192
tgt_vocab_size: 8192

# Prevent overwriting existing files in the folder
overwrite: True

# Corpus opts:
data:
    corpus_1:
        path_src: source_down_train_downscaled_GPT2_BERT_8192.txt
        path_tgt: target_down_train_downscaled_GPT2_BERT_8192.txt
    valid:
        path_src: source_down_val_downscaled_GPT2_BERT_8192.txt
        path_tgt: target_down_val_downscaled_GPT2_BERT_8192.txt

save_checkpoint_steps: 1000
keep_checkpoint: 10
seed: 3435
train_steps: 120000
valid_steps: 10000
warmup_steps: 8000
report_every: 100
early_stopping: 4

decoder_type: transformer
encoder_type: transformer
word_vec_size: 512
hidden_size: 512
layers: 6
transformer_ff: 2048
heads: 8

model_dtype: "fp16"
accum_count: 8
optim: adam
adam_beta1: 0.9
adam_beta2: 0.998
decay_method: noam
learning_rate: 2.0
max_grad_norm: 0.0

batch_size: 4096
batch_type: tokens
normalization: tokens
dropout: 0.1
label_smoothing: 0.1

param_init: 0.0
param_init_glorot: 'true'
position_encoding: 'true'

world_size: 2
gpu_ranks: [0,1]