Hi, everyone. I found a strange problem. I want to train a English to Chinese seq2seq model by lua version OpenNMT. At first, I trained it with all default parameters, everything works fine.
However, when encoder_type was set to brnn, the training log would stop to output after training a while, although the command top and nvidia-smi show that the CPU and GPU is still running.
I have tried to 1) cancel and then continue training, 2) pull the latest code from github, 3) change the brnn_merge parameter from default sum to contact. The problem still exists.
Does anyone can help me to solve this? Any advise or guidance would be greatly appreciated.
If brnn_merge is sum, it stops to log after training a hour.
If set brnn_merge to contact, the log was terminate after 5 hours.
The memory of my machine is 32GB, and run this training only. I am not sure whether it is enough for brnn training.
However, the machine was running another training now. I will try to use log_file instead of redirection and monitor the memory usage for training a new brnn model again as soon as possible.
$ Warning: cannot write object field <logFile> of <Logger> <?>.callback.globalLogger
$ Warning: cannot write object field <logFile> of <Logger> <?>.callback.globalLogger
Then, it stop to output. The machine still had 22GB free memory available at that time. However, I found the output of nvidia-smi command is abnormal:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66 Driver Version: 375.66 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 0000:01:00.0 Off | N/A |
| 51% 66C P2 80W / 250W | 2107MiB / 11171MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 0000:02:00.0 Off | N/A |
| 31% 44C P8 9W / 250W | 2107MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 14647 C /home/yuchang/torch/install/bin/luajit 2105MiB |
| 1 14647 C /home/yuchang/torch/install/bin/luajit 2105MiB |
+-----------------------------------------------------------------------------+
As you see, one GPU used 100%, but another is 0%, I tried two times, both time the GPU was keeping that state after the logging output stopped. Hope this could give you some clue for fixing this problem.
I have observed a similar problem. I get these same warnings and then the jobs quit after certain epochs without any error messages. I am using the pdbrnn encoder and am using the -log_file option and not redirecting output to a file.
The jobs generally quit after 4.5 hours of training. The machine has 128GB of RAM and 11GB of GPU memory. I have been working with the same data for a while and 11GB GPU RAM has been sufficient in the past. Not sure if the RAM was the problem.