Training stuck (multi GPU, transformer)

dimitarsh1 · August 28, 2019, 10:29am

Hello

I am running a transformer on multiple GPUs (4 in total)
I use the following command/setup:

python $OPENNMT/train.py
-data $ENGINEDIR/data/ready_to_train -save_model $MODELDIR/model
-layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8
-encoder_type transformer -decoder_type transformer -position_encoding
-train_steps 9750 -max_generator_batches 2 -dropout 0.1
-batch_size 4069 -batch_type tokens -normalization tokens -accum_count 2
-optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 2000 -learning_rate 2
-max_grad_norm 0 -param_init 0 -param_init_glorot
-label_smoothing 0.1 -valid_steps 250 -save_checkpoint_steps 250
-report_every 100
-world_size 4 -gpu_ranks 0 1 2 3
-log_file $MODELDIR/train.log

What happens is that it halts (gets stuck) at ‘starting training loop without validation…’

[2019-08-28 11:08:26,181 INFO] encoder: 42993664
[2019-08-28 11:08:26,181 INFO] decoder: 75336441
[2019-08-28 11:08:26,181 INFO] * number of parameters: 118330105
[2019-08-28 11:08:26,183 INFO] Starting training on GPU: [0, 1, 2, 3]
[2019-08-28 11:08:26,183 INFO] Start training loop without validation…

Any ideas?

Kind regards,
Dimitar

dimitarsh1 · August 28, 2019, 10:36am

The interesting thing is that actually, sometimes it works, sometimes it doesn’t.

Thanks for any suggestions or ideas.

By the way I am running the version of OpenNMT-oy I pulled on 26th Aug.

guillaumekln · September 10, 2019, 10:22am

Did you try with the latest PyTorch version?

dimitarsh1 · September 12, 2019, 10:15pm

Hi Guillaume,

I tried with 1.1 and 1.2; python 3.7 and python 3.6;…
My current setup is: torch 1.1 and python 3.7.3 and cuda 10.0 on 435.21 nvidia driver.
Still sometimes it works just as supposed and sometimes it is stuck when loading the data. I can see in through the nvidia-smi that only 10% of the memory of all GPUs is consumed and nothing progresses.

But, again, sometimes it just works.

dimitarsh1 · September 13, 2019, 7:49am

A clarification. I replicated the conda environment and updated the torch to 1.2. The training is stuck every time.

[2019-09-13 08:45:58,532 INFO] Start training loop without validation...

And stays there for ever.

guillaumekln · September 13, 2019, 7:51am

Maybe @vince62s knows more about this issue.

vince62s · September 13, 2019, 8:21am

This does not look like the same issue where it hangs at a given step in the training.

Yours looks like more an issue with the initial distributed steps.
I would set a verbose level to 2 to print what is happening.

rawkintrevo · November 5, 2019, 3:37am

I’m having same issue- I set verbose to level 2, the only additional information I get is:

[2019-11-05 03:29:13,333 INFO] Start training loop without validation...
[2019-11-05 03:29:21,467 INFO] number of examples: 100000
/home/rawkintrevo/.local/lib/python3.6/site-packages/torchtext/data/field.py:359: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  var = torch.tensor(arr, dtype=self.dtype, device=device)
[2019-11-05 03:29:28,189 INFO] GpuRank 0: index: 0
[2019-11-05 03:29:28,189 INFO] GpuRank 0: reduce_counter: 1                             n_minibatch 1

so far I have not been to get it to work even occasionally (as OP does).
I’m Torch 1.2 / Python 3.6 / CUDA 10.1 / NVidia 430.26. fwiw. I’m also seeing shockingly low GPU utilization.

Thoughts?

(EDIT: updated original post removing GPU Rank 1: output which was a result of me monkeying in the source code per https://github.com/OpenNMT/OpenNMT-py/blob/master/onmt/utils/distributed.py#L29)

francoishernandez · November 5, 2019, 8:20am

What command line are you running?

rawkintrevo · November 5, 2019, 2:28pm

Sorry- I just fixed. My issue was with torch. fairseq also would hang.

Some diagnostics I did to solve my problem:

NCCL is what is being used in multi vs single, so I started there. I got it squawking with

export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL

From there we see its actually hanging on AllGather operations. So as a quick nuclear option I did:

export NCCL_P2P_DISABLE=1

Which worked (but is not ideal).

A bit of monkeying around later, I found I was able to turn NCCL_P2P and set the level to 2 (basically don’t let it go through the CPU).

export NCCL_P2P_LEVEL=2

I’m still having OpenNMT issues, but fairseq works, and the new OpenNMT issues I am fairly sure are unrelated to this thread. @dimitarsh1 that might help you(?).

memray · January 21, 2020, 10:16pm

Having the same problem. The training is working well with OpenNMT models (rnn, transformer etc.) but I was trying to include a Fairseq model in the OpenNMT pipeline and this problem occurs. My node setting is PyTorch 1.2/1.4 and CUDA 10.0. Fiddling with NCCL settings didn’t help.
I run the same code on another node with PyTorch 1.3.1 and it works.
Still not quite sure what is the key but hope my info helps.

ssdavidson · February 20, 2020, 8:48pm

Has anyone definitively solved this issue? I am experiencing the same problem. I am on a shared server so I would hazard to guess that something was updated without my knowledge which broke OpenNMT-py. Any suggestions? It seems to work fine when I train on a single GPU but gets stuck in the same postion as OP when I use multi-GPU. Thanks for any help you can provide!

CUDA = 10.2
Pytorch = 1.3.1

CUDA_VISIBLE_DEVICES=2,3,4 python3 train.py -data $DATA_DIR -save_model $SAVE_DIR -layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8 -encoder_type transformer -decoder_type transformer -position_encoding -train_steps 200000 -max_generator_batches 2 -dropout 0.1 -batch_size 2048 -batch_type tokens -normalization tokens -accum_count 1 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 -max_grad_norm 0 -param_init 0 -param_init_glorot -label_smoothing 0.1 -valid_steps 10000 -valid_batch_size 16 -save_checkpoint_steps 10000 -early_stopping 4 -world_size 3 -gpu_ranks 0 1 2

sinarech · June 1, 2024, 10:09am

I am currently having the same issue. I am training two quite big models using two GPUs per model (overall 4 GPUs). Everything went smoothly until the training was stuck for hours with the following message:

[2024-06-01 05:19:15,193 INFO] Step 9900/120000; acc: 36.7; ppl:  91.9; xent: 4.5; lr: 0.00089; sents:  193594; bsz: 3711/1720/121; 108987/50508 tok/s;   5372 sec;
[2024-06-01 05:20:08,797 INFO] Step 10000/120000; acc: 36.9; ppl:  91.1; xent: 4.5; lr: 0.00088; sents:  195604; bsz: 3651/1792/122; 108991/53475 tok/s;   5425 sec;
[rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=60104, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=60000) ran for 60910 milliseconds before timing out.
[rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=60104, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=60000) ran for 60910 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd2fc581d87 in /home/user/sinarech/.local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7fd2b21f66e6 in /home/user/sinarech/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7fd2b21f9c3d in /home/user/sinarech/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fd2b21fa839 in /home/user/sinarech/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd44a3 (0x7fd2fccd44a3 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x89134 (0x7fd2ff99d134 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x1097dc (0x7fd2ffa1d7dc in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=60104, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=60000) ran for 60910 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd2fc581d87 in /home/user/sinarech/.local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7fd2b21f66e6 in /home/user/sinarech/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7fd2b21f9c3d in /home/user/sinarech/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fd2b21fa839 in /home/user/sinarech/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd44a3 (0x7fd2fccd44a3 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x89134 (0x7fd2ff99d134 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x1097dc (0x7fd2ffa1d7dc in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd2fc581d87 in /home/user/sinarech/.local/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xdf6b11 (0x7fd2b1f50b11 in /home/user/sinarech/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd44a3 (0x7fd2fccd44a3 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x89134 (0x7fd2ff99d134 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x1097dc (0x7fd2ffa1d7dc in /lib/x86_64-linux-gnu/libc.so.6)

[2024-06-01 08:19:10,826 INFO] valid stats calculation
                           took: 10742.02825975418 s.
[2024-06-01 08:19:10,828 INFO] Train perplexity: 156.138
[2024-06-01 08:19:10,830 INFO] Train accuracy: 30.8526
[2024-06-01 08:19:10,830 INFO] Sentences processed: 8.77107e+06
[2024-06-01 08:19:10,830 INFO] Average bsz: 3757/1701/110
[2024-06-01 08:19:10,832 INFO] Validation perplexity: 153.026
[2024-06-01 08:19:10,832 INFO] Validation accuracy: 32.316
[2024-06-01 08:19:10,833 INFO] Model is improving ppl: inf --> 153.026.
[2024-06-01 08:19:10,835 INFO] Model is improving acc: -inf --> 32.316.

What can potentially cause this problem and how can I fix it?
This is my config file in case it helps:

# Configuration 1
save_data: /home/user/sinarech/NMT/NMT/models/config_1
log_file: /home/user/sinarech/NMT/NMT/models/config_1/train.log
save_model: /home/user/sinarech/NMT/NMT/models/config_1/model
## Where the vocab(s) will be written
src_vocab: /home/user/sinarech/NMT/NMT/models/config_1/vocab.src
tgt_vocab: /home/user/sinarech/NMT/NMT/models/config_1/vocab.tgt

src_vocab_size: 8192
tgt_vocab_size: 8192

# Prevent overwriting existing files in the folder
overwrite: True

# Corpus opts:
data:
    corpus_1:
        path_src: source_down_train_downscaled_GPT2_BERT_8192.txt
        path_tgt: target_down_train_downscaled_GPT2_BERT_8192.txt
    valid:
        path_src: source_down_val_downscaled_GPT2_BERT_8192.txt
        path_tgt: target_down_val_downscaled_GPT2_BERT_8192.txt

save_checkpoint_steps: 1000
keep_checkpoint: 10
seed: 3435
train_steps: 120000
valid_steps: 10000
warmup_steps: 8000
report_every: 100
early_stopping: 4

decoder_type: transformer
encoder_type: transformer
word_vec_size: 512
hidden_size: 512
layers: 6
transformer_ff: 2048
heads: 8

model_dtype: "fp16"
accum_count: 8
optim: adam
adam_beta1: 0.9
adam_beta2: 0.998
decay_method: noam
learning_rate: 2.0
max_grad_norm: 0.0

batch_size: 4096
batch_type: tokens
normalization: tokens
dropout: 0.1
label_smoothing: 0.1

param_init: 0.0
param_init_glorot: 'true'
position_encoding: 'true'

world_size: 2
gpu_ranks: [0,1]