Stop to output log in training when set encoder_type to brnn

huache · September 27, 2017, 7:22am

Hi, everyone. I found a strange problem. I want to train a English to Chinese seq2seq model by lua version OpenNMT. At first, I trained it with all default parameters, everything works fine.

However, when encoder_type was set to brnn, the training log would stop to output after training a while, although the command top and nvidia-smi show that the CPU and GPU is still running.

I have tried to 1) cancel and then continue training, 2) pull the latest code from github, 3) change the brnn_merge parameter from default sum to contact. The problem still exists.

Does anyone can help me to solve this? Any advise or guidance would be greatly appreciated.

guillaumekln · September 27, 2017, 7:42am

Hello,

Could you share the full command line you ran for this training?

huache · September 27, 2017, 7:46am

Thanks for your quickly reply! This is the command I used for training:

th train.lua -data /data/prepare_data/bpe_piece/en_cn_bpe_piece-train.t7 -save_model /data/models/bpe_piece_birnn/en_cn_bpe_piece_birnn -gpuid 1 2 -encoder_type brnn -brnn_merge concat > /data/logs/bpe_piece_birnn/en_cn_bpe_piece_birnn_train.log

The brnn_merge parameter was not added at first.

guillaumekln · September 27, 2017, 7:56am

How much time in your experience? Is your RAM full at this time?

Also, did you try without the output redirection to a file?

huache · September 27, 2017, 8:10am

If brnn_merge is sum, it stops to log after training a hour.

If set brnn_merge to contact, the log was terminate after 5 hours.

The memory of my machine is 32GB, and run this training only. I am not sure whether it is enough for brnn training.

However, the machine was running another training now. I will try to use log_file instead of redirection and monitor the memory usage for training a new brnn model again as soon as possible.

Thanks for your advice !

huache · October 4, 2017, 2:27pm

Hi, I have tried brnn again today. This is the command I used:

th train.lua -data /data/prepare_data/word_bpe/en_cn_word_bpe-train.t7 -save_model /data/models/word_bpe_brnn/en_cn_word_bpe_brnn -gpuid 1 2 -encoder_type brnn > /data/logs/word_bpe_brnn/en_cn_word_bpe_brnn_train.log

I got this warning:

$ Warning: cannot write object field <logFile> of <Logger> <?>.callback.globalLogger
$ Warning: cannot write object field <logFile> of <Logger> <?>.callback.globalLogger

This is the logging output:

[10/04/17 17:31:39 INFO] Using GPU(s): 1, 2
[10/04/17 17:31:39 INFO] Training Sequence to Sequence with Attention model...
[10/04/17 17:31:39 INFO] Loading data from '/data/prepare_data/word_bpe/en_cn_word_bpe-train.t7'...
[10/04/17 17:35:23 INFO]  * vocabulary size: source = 30283; target = 50004
[10/04/17 17:35:23 INFO]  * additional features: source = 0; target = 0
[10/04/17 17:35:23 INFO]  * maximum sequence length: source = 50; target = 51
[10/04/17 17:35:23 INFO]  * number of training sentences: 9831070
[10/04/17 17:35:23 INFO]  * number of batches: 153638
[10/04/17 17:35:23 INFO]    - source sequence lengths: equal
[10/04/17 17:35:23 INFO]    - maximum size: 64
[10/04/17 17:35:23 INFO]    - average size: 63.99
[10/04/17 17:35:23 INFO]    - capacity: 100.00%
[10/04/17 17:35:23 INFO] Building model...
[10/04/17 17:35:23 INFO]  * Encoder:
[10/04/17 17:35:24 INFO]    - word embeddings size: 500
[10/04/17 17:35:24 INFO]    - type: bidirectional RNN
[10/04/17 17:35:24 INFO]    - structure: cell = LSTM; layers = 2; rnn_size = 500; dropout = 0.3 (naive)
[10/04/17 17:35:24 INFO]  * Decoder:
[10/04/17 17:35:25 INFO]    - word embeddings size: 500
[10/04/17 17:35:25 INFO]    - attention: global (general)
[10/04/17 17:35:25 INFO]    - structure: cell = LSTM; layers = 2; rnn_size = 500; dropout = 0.3 (naive)
[10/04/17 17:35:25 INFO]  * Bridge: copy
[10/04/17 17:35:27 INFO] Initializing parameters...
[10/04/17 17:35:28 INFO]  * number of parameters: 78969504
[10/04/17 17:35:28 INFO] Preparing memory optimization...
[10/04/17 17:35:28 INFO]  * sharing 70% of output/gradInput tensors memory between clones
[10/04/17 17:35:31 INFO] Start training from epoch 1 to 13...
[10/04/17 17:35:31 INFO] 
[10/04/17 17:36:05 INFO] Epoch 1 ; Iteration 50/76819 ; Optim SGD LR 1.000000 ; Source tokens/s 3147 ; Perplexity 623284.80
[10/04/17 17:36:19 INFO] Epoch 1 ; Iteration 100/76819 ; Optim SGD LR 1.000000 ; Source tokens/s 5517 ; Perplexity 193225.77
......
[10/04/17 18:11:53 INFO] Epoch 1 ; Iteration 7200/76819 ; Optim SGD LR 1.000000 ; Source tokens/s 5756 ; Perplexity 70.26
[10/04/17 18:12:08 INFO] Epoch 1 ; Iteration 7250/76819 ; Optim SGD LR 1.000000 ; Source tokens/s 5502 ; Perplexity 60.40
[10/04/17 18:12:23 INFO] Epoch 1 ; Iteration 7300/76819 ; Optim SGD LR 1.000000 ; Source tokens/s 5320 ; Perplexity 57.61
[10/04/17 18:12:38 INFO] Epoch 1 ; Iteration 7350/76819 ; Optim SGD LR 1.000000 ; Source tokens/s 5673 ; Perplexity 62.43
[10/04/17 18:12:53 INFO] Epoch 1 ; Iteration 7400/76819 ; Optim SGD LR 1.000000 ; Source tokens/s 5721 ; Perplexity 69.73

Then, it stop to output. The machine still had 22GB free memory available at that time. However, I found the output of nvidia-smi command is abnormal:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66                 Driver Version: 375.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 0000:01:00.0     Off |                  N/A |
| 51%   66C    P2    80W / 250W |   2107MiB / 11171MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 0000:02:00.0     Off |                  N/A |
| 31%   44C    P8     9W / 250W |   2107MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     14647    C   /home/yuchang/torch/install/bin/luajit        2105MiB |
|    1     14647    C   /home/yuchang/torch/install/bin/luajit        2105MiB |
+-----------------------------------------------------------------------------+

As you see, one GPU used 100%, but another is 0%, I tried two times, both time the GPU was keeping that state after the logging output stopped. Hope this could give you some clue for fixing this problem.

Shruti · October 4, 2017, 5:05pm

Hi,

I have observed a similar problem. I get these same warnings and then the jobs quit after certain epochs without any error messages. I am using the pdbrnn encoder and am using the -log_file option and not redirecting output to a file.

The jobs generally quit after 4.5 hours of training. The machine has 128GB of RAM and 11GB of GPU memory. I have been working with the same data for a while and 11GB GPU RAM has been sufficient in the past. Not sure if the RAM was the problem.

guillaumekln · October 4, 2017, 5:50pm

Are you also training with multiple GPUs?

Shruti · October 4, 2017, 6:06pm

No, with a single GPU

guillaumekln · October 5, 2017, 8:47am

Strange, never heard or encountered this issue before.

What Torch version are you running?

cd ~/torch
git rev-parse HEAD

CUDA version?

huache · October 5, 2017, 12:05pm

torch version: 9c2ef7f185c682ea333e06654cb6e5b67dfe7cd2
CUDA version: CUDA 8.0 (Installed by local deb)

Do I need to update to the latest torch version?

By the way, Is there any way to disable the warning when using train.lua by -log_file parameter ? It is a bit annoying.

Shruti · October 6, 2017, 12:54am

Hi @huache, there is a way to set the levels of logs that are output, you can find it in the log file options.

Hi Guillaume, my torch version is: 5beb83c46e91abd273c192a3fa782b62217072a6
and CUDA version is 8.0

Let me note that I haven’t observed this issue since then. Could be a one time memory overflow issue. I will let you know if this happens again.

huache · October 6, 2017, 1:51am

Thank you for your advice.

Well, it seems that the warning info that I want to disable output by torch not OpenNMT, because it ouput to screen not the log_file I set to.

However, it didn’t emerge before. I am not sure which update of OpenNMT activate this warning.