Firstly, I have validated PyTorch can find GPU resources.
>>> torch.cuda.is_available()
True
>>> torch.cuda.current_device()
0
>>> torch.cuda.device(0)
<torch.cuda.device object at 0x7f8c0a3cec50>
>>> torch.cuda.device_count()
1
>>> torch.cuda.get_device_name(0)
'NVIDIA Tesla K80'
However, OpenNMT is not finding the GPU resource. What is going on here?
I am using OpenNMT from GitHub - OpenNMT/OpenNMT-py at 1.2.0
Traceback (most recent call last):
File "../../OpenNMT-py/train.py", line 6, in <module>
main()
File "/root/work/huggingface-models/OpenNMT-py/onmt/bin/train.py", line 197, in main
train(opt)
File "/root/work/huggingface-models/OpenNMT-py/onmt/bin/train.py", line 91, in train
p.join()
File "/root/miniconda3/envs/open-nmt-env/lib/python3.7/multiprocessing/process.py", line 140, in join
res = self._popen.wait(timeout)
File "/root/miniconda3/envs/open-nmt-env/lib/python3.7/multiprocessing/popen_fork.py", line 48, in wait
return self.poll(os.WNOHANG if timeout == 0.0 else 0)
File "/root/miniconda3/envs/open-nmt-env/lib/python3.7/multiprocessing/popen_fork.py", line 28, in poll
pid, sts = os.waitpid(self.pid, flag)
File "/root/work/huggingface-models/OpenNMT-py/onmt/bin/train.py", line 181, in signal_handler
raise Exception(msg)
Exception:
-- Tracebacks above this line can probably
be ignored --
Traceback (most recent call last):
File "/root/work/huggingface-models/OpenNMT-py/onmt/bin/train.py", line 135, in run
gpu_rank = onmt.utils.distributed.multi_init(opt, device_id)
File "/root/work/huggingface-models/OpenNMT-py/onmt/utils/distributed.py", line 27, in multi_init
world_size=dist_world_size, rank=opt.gpu_ranks[device_id])
File "/root/miniconda3/envs/open-nmt-env/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 510, in init_process_group
timeout=timeout))
File "/root/miniconda3/envs/open-nmt-env/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 603, in _new_process_group_helper
timeout)
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!