Stop to output log in training when set encoder_type to brnn

Hi, everyone. I found a strange problem. I want to train a English to Chinese seq2seq model by lua version OpenNMT. At first, I trained it with all default parameters, everything works fine.

However, when encoder_type was set to brnn, the training log would stop to output after training a while, although the command top and nvidia-smi show that the CPU and GPU is still running.

I have tried to 1) cancel and then continue training, 2) pull the latest code from github, 3) change the brnn_merge parameter from default sum to contact. The problem still exists.

Does anyone can help me to solve this? Any advise or guidance would be greatly appreciated.

Hello,

Could you share the full command line you ran for this training?

Thanks for your quickly reply! This is the command I used for training:

th train.lua -data /data/prepare_data/bpe_piece/en_cn_bpe_piece-train.t7 -save_model /data/models/bpe_piece_birnn/en_cn_bpe_piece_birnn -gpuid 1 2 -encoder_type brnn -brnn_merge concat > /data/logs/bpe_piece_birnn/en_cn_bpe_piece_birnn_train.log

The brnn_merge parameter was not added at first.

How much time in your experience? Is your RAM full at this time?

Also, did you try without the output redirection to a file?

If brnn_merge is sum, it stops to log after training a hour.

If set brnn_merge to contact, the log was terminate after 5 hours.

The memory of my machine is 32GB, and run this training only. I am not sure whether it is enough for brnn training.

However, the machine was running another training now. I will try to use log_file instead of redirection and monitor the memory usage for training a new brnn model again as soon as possible.

Thanks for your advice !

Hi, I have tried brnn again today. This is the command I used:

th train.lua -data /data/prepare_data/word_bpe/en_cn_word_bpe-train.t7 -save_model /data/models/word_bpe_brnn/en_cn_word_bpe_brnn -gpuid 1 2 -encoder_type brnn > /data/logs/word_bpe_brnn/en_cn_word_bpe_brnn_train.log

I got this warning:

$ Warning: cannot write object field <logFile> of <Logger> <?>.callback.globalLogger
$ Warning: cannot write object field <logFile> of <Logger> <?>.callback.globalLogger

This is the logging output:

[10/04/17 17:31:39 INFO] Using GPU(s): 1, 2
[10/04/17 17:31:39 INFO] Training Sequence to Sequence with Attention model...
[10/04/17 17:31:39 INFO] Loading data from '/data/prepare_data/word_bpe/en_cn_word_bpe-train.t7'...
[10/04/17 17:35:23 INFO]  * vocabulary size: source = 30283; target = 50004
[10/04/17 17:35:23 INFO]  * additional features: source = 0; target = 0
[10/04/17 17:35:23 INFO]  * maximum sequence length: source = 50; target = 51
[10/04/17 17:35:23 INFO]  * number of training sentences: 9831070
[10/04/17 17:35:23 INFO]  * number of batches: 153638
[10/04/17 17:35:23 INFO]    - source sequence lengths: equal
[10/04/17 17:35:23 INFO]    - maximum size: 64
[10/04/17 17:35:23 INFO]    - average size: 63.99
[10/04/17 17:35:23 INFO]    - capacity: 100.00%
[10/04/17 17:35:23 INFO] Building model...
[10/04/17 17:35:23 INFO]  * Encoder:
[10/04/17 17:35:24 INFO]    - word embeddings size: 500
[10/04/17 17:35:24 INFO]    - type: bidirectional RNN
[10/04/17 17:35:24 INFO]    - structure: cell = LSTM; layers = 2; rnn_size = 500; dropout = 0.3 (naive)
[10/04/17 17:35:24 INFO]  * Decoder:
[10/04/17 17:35:25 INFO]    - word embeddings size: 500
[10/04/17 17:35:25 INFO]    - attention: global (general)
[10/04/17 17:35:25 INFO]    - structure: cell = LSTM; layers = 2; rnn_size = 500; dropout = 0.3 (naive)
[10/04/17 17:35:25 INFO]  * Bridge: copy
[10/04/17 17:35:27 INFO] Initializing parameters...
[10/04/17 17:35:28 INFO]  * number of parameters: 78969504
[10/04/17 17:35:28 INFO] Preparing memory optimization...
[10/04/17 17:35:28 INFO]  * sharing 70% of output/gradInput tensors memory between clones
[10/04/17 17:35:31 INFO] Start training from epoch 1 to 13...
[10/04/17 17:35:31 INFO] 
[10/04/17 17:36:05 INFO] Epoch 1 ; Iteration 50/76819 ; Optim SGD LR 1.000000 ; Source tokens/s 3147 ; Perplexity 623284.80
[10/04/17 17:36:19 INFO] Epoch 1 ; Iteration 100/76819 ; Optim SGD LR 1.000000 ; Source tokens/s 5517 ; Perplexity 193225.77
......
[10/04/17 18:11:53 INFO] Epoch 1 ; Iteration 7200/76819 ; Optim SGD LR 1.000000 ; Source tokens/s 5756 ; Perplexity 70.26
[10/04/17 18:12:08 INFO] Epoch 1 ; Iteration 7250/76819 ; Optim SGD LR 1.000000 ; Source tokens/s 5502 ; Perplexity 60.40
[10/04/17 18:12:23 INFO] Epoch 1 ; Iteration 7300/76819 ; Optim SGD LR 1.000000 ; Source tokens/s 5320 ; Perplexity 57.61
[10/04/17 18:12:38 INFO] Epoch 1 ; Iteration 7350/76819 ; Optim SGD LR 1.000000 ; Source tokens/s 5673 ; Perplexity 62.43
[10/04/17 18:12:53 INFO] Epoch 1 ; Iteration 7400/76819 ; Optim SGD LR 1.000000 ; Source tokens/s 5721 ; Perplexity 69.73 

Then, it stop to output. The machine still had 22GB free memory available at that time. However, I found the output of nvidia-smi command is abnormal:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66                 Driver Version: 375.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 0000:01:00.0     Off |                  N/A |
| 51%   66C    P2    80W / 250W |   2107MiB / 11171MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 0000:02:00.0     Off |                  N/A |
| 31%   44C    P8     9W / 250W |   2107MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     14647    C   /home/yuchang/torch/install/bin/luajit        2105MiB |
|    1     14647    C   /home/yuchang/torch/install/bin/luajit        2105MiB |
+-----------------------------------------------------------------------------+

As you see, one GPU used 100%, but another is 0%, I tried two times, both time the GPU was keeping that state after the logging output stopped. Hope this could give you some clue for fixing this problem.

Hi,

I have observed a similar problem. I get these same warnings and then the jobs quit after certain epochs without any error messages. I am using the pdbrnn encoder and am using the -log_file option and not redirecting output to a file.

The jobs generally quit after 4.5 hours of training. The machine has 128GB of RAM and 11GB of GPU memory. I have been working with the same data for a while and 11GB GPU RAM has been sufficient in the past. Not sure if the RAM was the problem.

Are you also training with multiple GPUs?

No, with a single GPU

Strange, never heard or encountered this issue before.

What Torch version are you running?

cd ~/torch
git rev-parse HEAD

CUDA version?

torch version: 9c2ef7f185c682ea333e06654cb6e5b67dfe7cd2
CUDA version: CUDA 8.0 (Installed by local deb)

Do I need to update to the latest torch version?

By the way, Is there any way to disable the warning when using train.lua by -log_file parameter ? It is a bit annoying.

Hi @huache, there is a way to set the levels of logs that are output, you can find it in the log file options.

Hi Guillaume, my torch version is: 5beb83c46e91abd273c192a3fa782b62217072a6
and CUDA version is 8.0

Let me note that I haven’t observed this issue since then. Could be a one time memory overflow issue. I will let you know if this happens again.

Thank you for your advice.

Well, it seems that the warning info that I want to disable output by torch not OpenNMT, because it ouput to screen not the log_file I set to.

However, it didn’t emerge before. I am not sure which update of OpenNMT activate this warning.