Error while training/testing on different gpu architecture

pdakwale · January 24, 2018, 11:16am

I am using open-nmt on a cluster maintained by slurm utility. Most of the gpus are NVIDIA pascal machines. Some are Maxwell. Slurm randomly assigns jobs to different devices depending on the availability. If I use a model trained on one architecture to test on another gpu architecture, it throws “invalid device error”.

cuda runtime error (8) : invalid device function at /tmp/luarocks_cutorch-scm-1-9435/cutorch/lib/THC/generic/THCTensorMath.cu:35

I have not faced this issue while using other torch applications or theano. Does Open-nmt restricts training and testing to be on same gpu architecture ? Is there a workaround for this ?
I am using latest torch version pulled on January 15th and open-nmt version pulled on January 16th.

guillaumekln · January 24, 2018, 11:23am

Did you copy the Torch installation from one machine to another?

pdakwale · January 24, 2018, 11:50am

No. I installed torch on a head node which links to a common cuda repository shared on all the devices. I am not facing this error with other applications using cuda e.g. Open-NMT python version.

emartinezVic · January 24, 2018, 12:18pm

It seems a problem with torch/cudnn as @guillaumekln suggests.
You can find a discussion about a related here. , maybe a torch reinstallation can solve the problem.

However, I will check first the CUDA_VISIBLE_DEVICES variable
You can set it doing in bash:

export CUDA_VISIBLE_DEVICES=0,1,2,3

This error can be related to a bad identifier used to refer to a gpu. For instance, using -gpuid 0 will not work when using a gpu model, because onmt will understand that you want to work on CPU.
A similar thing happens when you have 4 gpus but use -gpuid 5, because onmt won’t be able to find the indicated gpu.

Hope those ideas can help you.

guillaumekln · January 24, 2018, 2:48pm

If you have a single Torch installation, you should target multiple architectures by setting the following environment variable before compiling Torch:

TORCH_CUDA_ARCH_LIST="Maxwell;Pascal"

Otherwise, only the code of the detected architecture will be generated. Other frameworks like PyTorch ship binaries compiled for all common architectures.

I hope this helps solving your issue.