CUDA error: no kernel image is available for execution on the device

argosopentech · September 3, 2021, 12:55pm

Hi,

I’m trying to run OpenNMT-py on an RTX 3090 from vast.ai and getting a CUDA error:

Traceback (most recent call last):
  File "/home/argosopentech/env/bin/onmt_train", line 11, in <module>
    load_entry_point('OpenNMT-py', 'console_scripts', 'onmt_train')()
  File "/home/argosopentech/OpenNMT-py/onmt/bin/train.py", line 172, in main
    train(opt)
  File "/home/argosopentech/OpenNMT-py/onmt/bin/train.py", line 157, in train
    train_process(opt, device_id=0)
  File "/home/argosopentech/OpenNMT-py/onmt/train_single.py", line 109, in main
    trainer.train(
  File "/home/argosopentech/OpenNMT-py/onmt/trainer.py", line 224, in train
    for i, (batches, normalization) in enumerate(
  File "/home/argosopentech/OpenNMT-py/onmt/trainer.py", line 166, in _accum_batches
    num_tokens = batch.tgt[1:, :, 0].ne(
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

I’m using a nvidia/cuda:11.3.0-devel-ubuntu20.04 Docker container and installing OpenNMT-py from source:

$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:47:00.0 Off |                  N/A |
|  0%   24C    P8    18W / 375W |      1MiB / 24268MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

The issue looks related to this PyTorch issue, but I’m using a newer graphics card than the people in that issue. OpenNMT-py uses torch>=1.6.0 and the newest version of torch is 1.9.0 is that the issue?

ymoslem · September 3, 2021, 9:10pm

I do not know if updating PyTorch will help, but in the case it will, I have the following:

torch==1.9.0+cu111
torchaudio==0.9.0
torchtext==0.5.0
torchvision==0.10.0+cu111

In my case, I got an error around this, but I do not remember in which file. I solved it by simply changing the problematic line.

argosopentech · September 3, 2021, 10:48pm

That was it thanks!

Looking on the PyTorch version selector it looks like the default version is for CUDA 10 while I had CUDA 11, getting the right version of torch looks like it fixes the issue.

pip install torch==1.9.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html