I am currently having the same issue. I am training two quite big models using two GPUs per model (overall 4 GPUs). Everything went smoothly until the training was stuck for hours with the following message:
[2024-06-01 05:19:15,193 INFO] Step 9900/120000; acc: 36.7; ppl: 91.9; xent: 4.5; lr: 0.00089; sents: 193594; bsz: 3711/1720/121; 108987/50508 tok/s; 5372 sec;
[2024-06-01 05:20:08,797 INFO] Step 10000/120000; acc: 36.9; ppl: 91.1; xent: 4.5; lr: 0.00088; sents: 195604; bsz: 3651/1792/122; 108991/53475 tok/s; 5425 sec;
[rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=60104, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=60000) ran for 60910 milliseconds before timing out.
[rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=60104, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=60000) ran for 60910 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd2fc581d87 in /home/user/sinarech/.local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7fd2b21f66e6 in /home/user/sinarech/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7fd2b21f9c3d in /home/user/sinarech/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fd2b21fa839 in /home/user/sinarech/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd44a3 (0x7fd2fccd44a3 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x89134 (0x7fd2ff99d134 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x1097dc (0x7fd2ffa1d7dc in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=60104, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=60000) ran for 60910 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd2fc581d87 in /home/user/sinarech/.local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7fd2b21f66e6 in /home/user/sinarech/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7fd2b21f9c3d in /home/user/sinarech/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fd2b21fa839 in /home/user/sinarech/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd44a3 (0x7fd2fccd44a3 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x89134 (0x7fd2ff99d134 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x1097dc (0x7fd2ffa1d7dc in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd2fc581d87 in /home/user/sinarech/.local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xdf6b11 (0x7fd2b1f50b11 in /home/user/sinarech/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd44a3 (0x7fd2fccd44a3 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x89134 (0x7fd2ff99d134 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x1097dc (0x7fd2ffa1d7dc in /lib/x86_64-linux-gnu/libc.so.6)
[2024-06-01 08:19:10,826 INFO] valid stats calculation
took: 10742.02825975418 s.
[2024-06-01 08:19:10,828 INFO] Train perplexity: 156.138
[2024-06-01 08:19:10,830 INFO] Train accuracy: 30.8526
[2024-06-01 08:19:10,830 INFO] Sentences processed: 8.77107e+06
[2024-06-01 08:19:10,830 INFO] Average bsz: 3757/1701/110
[2024-06-01 08:19:10,832 INFO] Validation perplexity: 153.026
[2024-06-01 08:19:10,832 INFO] Validation accuracy: 32.316
[2024-06-01 08:19:10,833 INFO] Model is improving ppl: inf --> 153.026.
[2024-06-01 08:19:10,835 INFO] Model is improving acc: -inf --> 32.316.
What can potentially cause this problem and how can I fix it?
This is my config file in case it helps:
# Configuration 1
save_data: /home/user/sinarech/NMT/NMT/models/config_1
log_file: /home/user/sinarech/NMT/NMT/models/config_1/train.log
save_model: /home/user/sinarech/NMT/NMT/models/config_1/model
## Where the vocab(s) will be written
src_vocab: /home/user/sinarech/NMT/NMT/models/config_1/vocab.src
tgt_vocab: /home/user/sinarech/NMT/NMT/models/config_1/vocab.tgt
src_vocab_size: 8192
tgt_vocab_size: 8192
# Prevent overwriting existing files in the folder
overwrite: True
# Corpus opts:
data:
corpus_1:
path_src: source_down_train_downscaled_GPT2_BERT_8192.txt
path_tgt: target_down_train_downscaled_GPT2_BERT_8192.txt
valid:
path_src: source_down_val_downscaled_GPT2_BERT_8192.txt
path_tgt: target_down_val_downscaled_GPT2_BERT_8192.txt
save_checkpoint_steps: 1000
keep_checkpoint: 10
seed: 3435
train_steps: 120000
valid_steps: 10000
warmup_steps: 8000
report_every: 100
early_stopping: 4
decoder_type: transformer
encoder_type: transformer
word_vec_size: 512
hidden_size: 512
layers: 6
transformer_ff: 2048
heads: 8
model_dtype: "fp16"
accum_count: 8
optim: adam
adam_beta1: 0.9
adam_beta2: 0.998
decay_method: noam
learning_rate: 2.0
max_grad_norm: 0.0
batch_size: 4096
batch_type: tokens
normalization: tokens
dropout: 0.1
label_smoothing: 0.1
param_init: 0.0
param_init_glorot: 'true'
position_encoding: 'true'
world_size: 2
gpu_ranks: [0,1]