Perplexity nan at epoch 5

saz0568 · October 10, 2017, 2:41am

I just run the command as the tutorial. And I set the vocab size to 200004. No more other perticular settings.
The training runs well at epoch 1 to epoch 4. Problem occurs when training epoch 5 at about 100000 iteration. Shown below:

th train.lua -gpuid 1 -data data/deepmega_2/demo-train.t7 -save_model model_20170930 -train_from model_20170930_epoch4_28.41.t7 -continue
[10/09/17 09:11:42 INFO] Using GPU(s): 1
[10/09/17 09:11:42 WARNING] The caching CUDA memory allocator is enabled. This allocator improves performance at the cost of a higher GPU memory usage. To optimize for memory, consider disabling it by setting the environment variable: THC_CACHING_ALLOCATOR=0
[10/09/17 09:11:42 INFO] Training Sequence to Sequence with Attention model…
[10/09/17 09:11:42 INFO] Loading data from ‘data/deepmega_2/demo-train.t7’…
[10/09/17 09:15:48 INFO] * vocabulary size: source = 200008; target = 200008
[10/09/17 09:15:48 INFO] * additional features: source = 0; target = 0
[10/09/17 09:15:48 INFO] * maximum sequence length: source = 100; target = 101
[10/09/17 09:15:48 INFO] * number of training sentences: 10459873
[10/09/17 09:15:48 INFO] * number of batches: 163480
[10/09/17 09:15:48 INFO] - source sequence lengths: equal
[10/09/17 09:15:48 INFO] - maximum size: 64
[10/09/17 09:15:48 INFO] - average size: 63.98
[10/09/17 09:15:48 INFO] - capacity: 100.00%
[10/09/17 09:15:48 INFO] Loading checkpoint ‘model_20170930_epoch4_28.41.t7’…
[10/09/17 09:16:10 INFO] Resuming training from epoch 5 at iteration 1…
[10/09/17 09:16:14 INFO] Preparing memory optimization…
[10/09/17 09:16:14 INFO] * sharing 69% of output/gradInput tensors memory between clones
[10/09/17 09:16:14 INFO] Restoring random number generator states…
[10/09/17 09:16:14 INFO] Start training from epoch 5 to 13…
[10/09/17 09:16:14 INFO]
[10/09/17 09:17:27 INFO] Epoch 5 ; Iteration 50/163480 ; Optim SGD LR 1.0000 ; Source tokens/s 1700 ; Perplexity 5.83
[10/09/17 09:18:04 INFO] Epoch 5 ; Iteration 100/163480 ; Optim SGD LR 1.0000 ; Source tokens/s 2917 ; Perplexity 6.71
…
…
[10/10/17 07:58:32 INFO] Epoch 5 ; Iteration 99950/163480 ; Optim SGD LR 1.0000 ; Source tokens/s 2918 ; Perplexity 6.39
[10/10/17 07:59:18 INFO] Epoch 5 ; Iteration 100000/163480 ; Optim SGD LR 1.0000 ; Source tokens/s 3080 ; Perplexity 5.81
[10/10/17 07:59:18 INFO] Saving checkpoint to ‘model_20170930_checkpoint.t7’…
[10/10/17 08:00:16 INFO] Epoch 5 ; Iteration 100050/163480 ; Optim SGD LR 1.0000 ; Source tokens/s 2017 ; Perplexity 8.50
[10/10/17 08:00:56 INFO] Epoch 5 ; Iteration 100100/163480 ; Optim SGD LR 1.0000 ; Source tokens/s 2978 ; Perplexity 208.40
[10/10/17 08:01:42 INFO] Epoch 5 ; Iteration 100150/163480 ; Optim SGD LR 1.0000 ; Source tokens/s 2883 ; Perplexity 6707.49
[10/10/17 08:02:31 INFO] Epoch 5 ; Iteration 100200/163480 ; Optim SGD LR 1.0000 ; Source tokens/s 3140 ; Perplexity 156793.42
[10/10/17 08:03:12 INFO] Epoch 5 ; Iteration 100250/163480 ; Optim SGD LR 1.0000 ; Source tokens/s 2989 ; Perplexity 321254.26
[10/10/17 08:03:50 INFO] Epoch 5 ; Iteration 100300/163480 ; Optim SGD LR 1.0000 ; Source tokens/s 2919 ; Perplexity 8289724.30
[10/10/17 08:04:30 INFO] Epoch 5 ; Iteration 100350/163480 ; Optim SGD LR 1.0000 ; Source tokens/s 2880 ; Perplexity 124739329.75
[10/10/17 08:05:11 INFO] Epoch 5 ; Iteration 100400/163480 ; Optim SGD LR 1.0000 ; Source tokens/s 2996 ; Perplexity 30809960.90
[10/10/17 08:05:52 INFO] Epoch 5 ; Iteration 100450/163480 ; Optim SGD LR 1.0000 ; Source tokens/s 3027 ; Perplexity 63019598.33
[10/10/17 08:06:32 INFO] Epoch 5 ; Iteration 100500/163480 ; Optim SGD LR 1.0000 ; Source tokens/s 2957 ; Perplexity 50974644.64
[10/10/17 08:07:15 INFO] Epoch 5 ; Iteration 100550/163480 ; Optim SGD LR 1.0000 ; Source tokens/s 3197 ; Perplexity nan
[10/10/17 08:07:53 INFO] Epoch 5 ; Iteration 100600/163480 ; Optim SGD LR 1.0000 ; Source tokens/s 2855 ; Perplexity nan

It seems the training crash at iteration 100000 after saving to checkpoint but with no more errror message. Could anyone tell me what the problem it is and how to solve it?

guillaumekln · October 10, 2017, 7:04am

Can you try with a lower learning rate? For example:

th train.lua -gpuid 1 -data data/deepmega_2/demo-train.t7 -save_model model_20170930 -train_from model_20170930_epoch4_28.41.t7 -start_epoch 5 -learning_rate 0.5

saz0568 · October 10, 2017, 7:58am

Sure. I just tried as your quote.
It will take about 23 hours running to iteration 100000. Up to now, it runs OK.
I’ll update the running process tomorrow.
thx.

saz0568 · October 12, 2017, 2:57am

After modifying the learning rate to 0.5, the program runs well.
It seems that the problem has been solved.
Thanks for your support!