I am using open-nmt on a cluster maintained by slurm utility. Most of the gpus are NVIDIA pascal machines. Some are Maxwell. Slurm randomly assigns jobs to different devices depending on the availability. If I use a model trained on one architecture to test on another gpu architecture, it throws “invalid device error”.
cuda runtime error (8) : invalid device function at /tmp/luarocks_cutorch-scm-1-9435/cutorch/lib/THC/generic/THCTensorMath.cu:35
I have not faced this issue while using other torch applications or theano. Does Open-nmt restricts training and testing to be on same gpu architecture ? Is there a workaround for this ?
I am using latest torch version pulled on January 15th and open-nmt version pulled on January 16th.
Did you copy the Torch installation from one machine to another?
No. I installed torch on a head node which links to a common cuda repository shared on all the devices. I am not facing this error with other applications using cuda e.g. Open-NMT python version.
It seems a problem with torch/cudnn as @guillaumekln suggests.
You can find a discussion about a related here. , maybe a torch reinstallation can solve the problem.
However, I will check first the CUDA_VISIBLE_DEVICES variable
You can set it doing in bash:
This error can be related to a bad identifier used to refer to a gpu. For instance, using -gpuid 0 will not work when using a gpu model, because onmt will understand that you want to work on CPU.
A similar thing happens when you have 4 gpus but use -gpuid 5, because onmt won’t be able to find the indicated gpu.
Hope those ideas can help you.
If you have a single Torch installation, you should target multiple architectures by setting the following environment variable before compiling Torch:
Otherwise, only the code of the detected architecture will be generated. Other frameworks like PyTorch ship binaries compiled for all common architectures.
I hope this helps solving your issue.