OpenNMT-py starts to train on GPU then switches to CPU

I started running OpenNMT-py successfully and got to 300 iterations with nvidia-smi showing GPU usage as >90% and 100 iteration checkpoints taking a few minutes. Then with no error message GPU usage goes to 0%, only one CPU is being used, and training stops progressing.

Logs

[2021-01-30 14:43:32,647 INFO] encoder: 44517376
[2021-01-30 14:43:32,647 INFO] decoder: 25275220
[2021-01-30 14:43:32,647 INFO] * number of parameters: 69792596
[2021-01-30 14:43:32,741 INFO] Starting training on GPU: [0]
[2021-01-30 14:43:32,741 INFO] Start training loop and validate every 500 steps...
[2021-01-30 14:43:32,741 INFO] corpus_1's transforms: TransformPipe()
[2021-01-30 14:43:32,741 INFO] Loading ParallelCorpus(split_data/src-train.txt, split_data/tgt-train.txt, align=None)...
[2021-01-30 14:47:43,901 INFO] Step 100/ 1000; acc:  36.75; ppl: 4188.86; xent: 8.34; lr: 0.00001; 3478/1381 tok/s;    251 sec
[2021-01-30 14:52:02,880 INFO] Step 200/ 1000; acc:  55.53; ppl: 866.91; xent: 6.76; lr: 0.00002; 3359/1463 tok/s;    510 sec
[2021-01-30 14:56:09,505 INFO] Step 300/ 1000; acc:  70.45; ppl: 118.07; xent: 4.77; lr: 0.00004; 3557/1413 tok/s;    757 sec

GPU Status

$ nvidia-smi
+-----------------------------------------------------------------------------+
Sat Jan 30 15:17:50 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04    Driver Version: 460.27.04    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:00:04.0 Off |                    0 |
| N/A   73C    P0    76W / 149W |   9309MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A       981      G   /usr/lib/xorg/Xorg                  8MiB |
|    0   N/A  N/A      1065      G   /usr/bin/gnome-shell                3MiB |
|    0   N/A  N/A      1995      C   /usr/bin/python3                 9291MiB |
+-----------------------------------------------------------------------------+

CPU

top - 15:18:56 up 41 min,  3 users,  load average: 1.00, 1.00, 1.02
Tasks: 213 total,   2 running, 211 sleeping,   0 stopped,   0 zombie
%Cpu(s): 12.5 us,  0.0 sy,  0.0 ni, 87.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  30100.1 total,  25031.0 free,   3176.8 used,   1892.3 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  26525.9 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND            
   1995 argosop+  20   0   57.2g   2.8g 392248 R 100.3   9.6  35:51.38 onmt_train         

Code

https://github.com/argosopentech/onmt-models/tree/d048dac8583f86af12cb9cc820ccc7b12a5f816d

I tried downgrading CUDA version in the nvidia:cuda:10.2-runtime-ubuntu18.04 Docker container and got different weird behavior. I think this is a PyTorch not OpenNMT issue.

Posted on PyTorch forum: https://discuss.pytorch.org/t/errors-running-opennmt-py-with-torch-1-6-0-on-tesla-k80/110636