I tried with 1.1 and 1.2; python 3.7 and python 3.6;…
My current setup is: torch 1.1 and python 3.7.3 and cuda 10.0 on 435.21 nvidia driver.
Still sometimes it works just as supposed and sometimes it is stuck when loading the data. I can see in through the nvidia-smi that only 10% of the memory of all GPUs is consumed and nothing progresses.
I’m having same issue- I set verbose to level 2, the only additional information I get is:
[2019-11-05 03:29:13,333 INFO] Start training loop without validation...
[2019-11-05 03:29:21,467 INFO] number of examples: 100000
/home/rawkintrevo/.local/lib/python3.6/site-packages/torchtext/data/field.py:359: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
var = torch.tensor(arr, dtype=self.dtype, device=device)
[2019-11-05 03:29:28,189 INFO] GpuRank 0: index: 0
[2019-11-05 03:29:28,189 INFO] GpuRank 0: reduce_counter: 1 n_minibatch 1
so far I have not been to get it to work even occasionally (as OP does).
I’m Torch 1.2 / Python 3.6 / CUDA 10.1 / NVidia 430.26. fwiw. I’m also seeing shockingly low GPU utilization.
From there we see its actually hanging on AllGather operations. So as a quick nuclear option I did:
export NCCL_P2P_DISABLE=1
Which worked (but is not ideal).
A bit of monkeying around later, I found I was able to turn NCCL_P2P and set the level to 2 (basically don’t let it go through the CPU).
export NCCL_P2P_LEVEL=2
I’m still having OpenNMT issues, but fairseq works, and the new OpenNMT issues I am fairly sure are unrelated to this thread. @dimitarsh1 that might help you(?).
Having the same problem. The training is working well with OpenNMT models (rnn, transformer etc.) but I was trying to include a Fairseq model in the OpenNMT pipeline and this problem occurs. My node setting is PyTorch 1.2/1.4 and CUDA 10.0. Fiddling with NCCL settings didn’t help.
I run the same code on another node with PyTorch 1.3.1 and it works.
Still not quite sure what is the key but hope my info helps.
Has anyone definitively solved this issue? I am experiencing the same problem. I am on a shared server so I would hazard to guess that something was updated without my knowledge which broke OpenNMT-py. Any suggestions? It seems to work fine when I train on a single GPU but gets stuck in the same postion as OP when I use multi-GPU. Thanks for any help you can provide!