Unable to use -save_checkpoint_steps with gpu enabled

add · February 11, 2019, 1:46pm

Hi !

On Opennmt-py, the `-save_checkpoint_steps’ option does not work with the -gpu_ranks 1 option enabled. If I switch to plain cpu (by removing the gpu_ranks option) it does work.

Can someone share inputs on how to make it work with the gpu, because on the cpu it takes a really long time.

Thanks!

guillaumekln · February 11, 2019, 1:56pm

Hi,

You should select the device ID with CUDA_VISIBLE_DEVICES:

CUDA_VISIBLE_DEVICES=1 python train.py [...] -gpu_ranks 0

add · February 11, 2019, 2:12pm

Thank you for the super-quick response.

When I do as you’ve suggested, I’m getting the below error:
-----------------------------------------------
THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=34 error=38 : no CUDA-capable device is detected
Traceback (most recent call last):
File “train.py”, line 120, in
main(opt)
File “train.py”, line 51, in main
single_main(opt, 0)
File “/home/ubuntu/algo/amit/workspace/OpenNMT-py/onmt/train_single.py”, line 79, in main
opt = training_opt_postprocessing(opt, device_id)
File “/home/ubuntu/algo/amit/workspace/OpenNMT-py/onmt/train_single.py”, line 72, in training_opt_postprocessing
torch.cuda.set_device(device_id)
File “/home/ubuntu/anaconda3/envs/py36-th/lib/python3.6/site-packages/torch/cuda/init.py”, line 264, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (38) : no CUDA-capable device is detected at /pytorch/torch/csrc/cuda/Module.cpp:34
--------------------------------------------------------
I do have a working gpu, so something seems to be amiss. I can tell that because when I run the training in opennmt-py with -gpu_ranks 1, it tells me that its using the GPU, and the training goes pretty fast (about 10 times faster).

Thanks!

guillaumekln · February 11, 2019, 2:14pm

Do you have a single GPU? Then just use -gpu_ranks 0.

add · February 11, 2019, 2:17pm

Great! Thanks a lot! That worked like a charm.