I started running OpenNMT-py successfully and got to 300 iterations with nvidia-smi showing GPU usage as >90% and 100 iteration checkpoints taking a few minutes. Then with no error message GPU usage goes to 0%, only one CPU is being used, and training stops progressing.
Logs
[2021-01-30 14:43:32,647 INFO] encoder: 44517376
[2021-01-30 14:43:32,647 INFO] decoder: 25275220
[2021-01-30 14:43:32,647 INFO] * number of parameters: 69792596
[2021-01-30 14:43:32,741 INFO] Starting training on GPU: [0]
[2021-01-30 14:43:32,741 INFO] Start training loop and validate every 500 steps...
[2021-01-30 14:43:32,741 INFO] corpus_1's transforms: TransformPipe()
[2021-01-30 14:43:32,741 INFO] Loading ParallelCorpus(split_data/src-train.txt, split_data/tgt-train.txt, align=None)...
[2021-01-30 14:47:43,901 INFO] Step 100/ 1000; acc: 36.75; ppl: 4188.86; xent: 8.34; lr: 0.00001; 3478/1381 tok/s; 251 sec
[2021-01-30 14:52:02,880 INFO] Step 200/ 1000; acc: 55.53; ppl: 866.91; xent: 6.76; lr: 0.00002; 3359/1463 tok/s; 510 sec
[2021-01-30 14:56:09,505 INFO] Step 300/ 1000; acc: 70.45; ppl: 118.07; xent: 4.77; lr: 0.00004; 3557/1413 tok/s; 757 sec
GPU Status
$ nvidia-smi
+-----------------------------------------------------------------------------+
Sat Jan 30 15:17:50 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000000:00:04.0 Off | 0 |
| N/A 73C P0 76W / 149W | 9309MiB / 11441MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 981 G /usr/lib/xorg/Xorg 8MiB |
| 0 N/A N/A 1065 G /usr/bin/gnome-shell 3MiB |
| 0 N/A N/A 1995 C /usr/bin/python3 9291MiB |
+-----------------------------------------------------------------------------+
CPU
top - 15:18:56 up 41 min, 3 users, load average: 1.00, 1.00, 1.02
Tasks: 213 total, 2 running, 211 sleeping, 0 stopped, 0 zombie
%Cpu(s): 12.5 us, 0.0 sy, 0.0 ni, 87.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 30100.1 total, 25031.0 free, 3176.8 used, 1892.3 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 26525.9 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1995 argosop+ 20 0 57.2g 2.8g 392248 R 100.3 9.6 35:51.38 onmt_train
Code
https://github.com/argosopentech/onmt-models/tree/d048dac8583f86af12cb9cc820ccc7b12a5f816d