Nan perplexity after few minibatches in epoch 1

Shruti · July 26, 2017, 2:36pm

Hi I’m using the Pyramidal encoder as follows:

th preprocess.lua -data_type 'feattext' -train_src how-to-data/train.ark.txt -train_tgt how-to-data/labels.tr -valid_src how-to-data/cv.ark.txt -valid_tgt how-to-data/labels.cv -save_data how-to-data/training_feats -idx_files -src_seq_length 2435 -tgt_seq_length 245 > log/preprocess_training_feats_src2435_tgt245 &

DATA=how-to-data;MODEL=models;LOGOUT=log;CUDAN=0;LAYER=3;LR=0.1;DRT=naive;
DR=0;MTYPE=pdbrnn;EP=20;DC=0;DCS=10;BS=4;RNN=1024;

NAME=$MTYPE"_"layer$LAYER"_"rnn$RNN"_"sgd$LR"_"$DRT$DR;

CUDA_VISIBLE_DEVICES=$CUDAN THC_CACHING_ALLOCATOR=0 th train.lua \
             -data $DATA/training_feats-train.t7 \
             -save_model $MODEL/$NAME -encoder_type $MTYPE -report_every 1 \
             -rnn_size $RNN -word_vec_size 50 -layers $LAYER -max_batch_size $BS \
             -learning_rate $LR -dropout_type $DRT -dropout $DR \
             -learning_rate_decay $DC -end_epoch $EP -pdbrnn_merge concat \
             -residual false -gpuid 1 -log_file $LOGOUT/$NAME &

Here is the training log:
[07/25/17 19:26:37 INFO] Using GPU(s): 1
[07/25/17 19:26:37 INFO] Training Sequence to Sequence with Attention model...
[07/25/17 19:26:37 INFO] Loading data from 'how-to-data/training_feats-train.t7'...
[07/25/17 19:35:50 INFO]  * vocabulary size: source = *120; target = 74
[07/25/17 19:35:50 INFO]  * additional features: source = -; target = 0
[07/25/17 19:35:50 INFO]  * maximum sequence length: source = 2432; target = 246
[07/25/17 19:35:50 INFO]  * number of training sentences: 55100
[07/25/17 19:35:50 INFO]  * number of batches: 14681
[07/25/17 19:35:50 INFO]    - source sequence lengths: equal
[07/25/17 19:35:50 INFO]    - maximum size: 4
[07/25/17 19:35:50 INFO]    - average size: 3.75
[07/25/17 19:35:50 INFO]    - capacity: 100.00%
[07/25/17 19:35:50 INFO] Building model...
[07/25/17 19:35:50 INFO]  * Encoder:
[07/25/17 19:35:50 INFO]    - type: pyramidal deep bidirectional RNN
[07/25/17 19:35:50 INFO]    - structure: cell = LSTM; layers = 3; rnn_size = 1024; dropout = 0 (naive)
[07/25/17 19:35:51 INFO]  * Decoder:
[07/25/17 19:35:51 INFO]    - word embeddings size: 50
[07/25/17 19:35:51 INFO]    - attention: global (general)
[07/25/17 19:35:51 INFO]    - structure: cell = LSTM; layers = 3; rnn_size = 1024; dropout = 0 (naive)
[07/25/17 19:35:52 INFO]  * Bridge: copy
[07/25/17 19:36:11 INFO] Initializing parameters...
[07/25/17 19:37:01 INFO]  * number of parameters: 88372926
[07/25/17 19:37:01 INFO] Preparing memory optimization...
[07/25/17 19:37:28 INFO]  * sharing 70% of output/gradInput tensors memory between clones
[07/25/17 19:37:28 INFO] Start training from epoch 1 to 20...
[07/25/17 19:37:28 INFO] 
[07/25/17 19:50:35 INFO] Epoch 1 ; Iteration 1/14681 ; Optim SGD LR 0.1000 ; Source tokens/s 497 ; Perplexity 114.86
[07/25/17 19:52:42 INFO] Epoch 1 ; Iteration 2/14681 ; Optim SGD LR 0.1000 ; Source tokens/s 2488 ; Perplexity 109.19
[07/25/17 19:52:44 INFO] Epoch 1 ; Iteration 3/14681 ; Optim SGD LR 0.1000 ; Source tokens/s 71837 ; Perplexity 115.07
[07/25/17 19:52:46 INFO] Epoch 1 ; Iteration 4/14681 ; Optim SGD LR 0.1000 ; Source tokens/s 47624 ; Perplexity 114.02
[07/25/17 19:52:49 INFO] Epoch 1 ; Iteration 5/14681 ; Optim SGD LR 0.1000 ; Source tokens/s 100877 ; Perplexity 122.69
[07/25/17 20:39:01 INFO] Epoch 1 ; Iteration 6/14681 ; Optim SGD LR 0.1000 ; Source tokens/s 164 ; Perplexity 106.74
[07/25/17 20:39:03 INFO] Epoch 1 ; Iteration 7/14681 ; Optim SGD LR 0.1000 ; Source tokens/s 82978 ; Perplexity 114.27
.
.
.
[07/25/17 22:57:53 INFO] Epoch 1 ; Iteration 1834/14681 ; Optim SGD LR 0.1000 ; Source tokens/s 100836 ; Perplexity 44.11
[07/25/17 22:57:56 INFO] Epoch 1 ; Iteration 1835/14681 ; Optim SGD LR 0.1000 ; Source tokens/s 84555 ; Perplexity 47.83
[07/25/17 22:58:05 INFO] Epoch 1 ; Iteration 1836/14681 ; Optim SGD LR 0.1000 ; Source tokens/s 87526 ; Perplexity 44.11
[07/25/17 22:58:11 INFO] Epoch 1 ; Iteration 1837/14681 ; Optim SGD LR 0.1000 ; Source tokens/s 29780 ; Perplexity nan
[07/25/17 22:58:14 INFO] Epoch 1 ; Iteration 1838/14681 ; Optim SGD LR 0.1000 ; Source tokens/s 84150 ; Perplexity nan
[07/25/17 22:58:17 INFO] Epoch 1 ; Iteration 1839/14681 ; Optim SGD LR 0.1000 ; Source tokens/s 75360 ; Perplexity nan
[07/25/17 22:58:19 INFO] Epoch 1 ; Iteration 1840/14681 ; Optim SGD LR 0.1000 ; Source tokens/s 81100 ; Perplexity nan
[07/25/17 22:58:20 INFO] Epoch 1 ; Iteration 1841/14681 ; Optim SGD LR 0.1000 ; Source tokens/s 69073 ; Perplexity nan
[07/25/17 22:58:27 INFO] Epoch 1 ; Iteration 1842/14681 ; Optim SGD LR 0.1000 ; Source tokens/s 85101 ; Perplexity nan
[07/25/17 22:58:33 INFO] Epoch 1 ; Iteration 1843/14681 ; Optim SGD LR 0.1000 ; Source tokens/s 87261 ; Perplexity nan
[07/25/17 22:58:34 INFO] Epoch 1 ; Iteration 1844/14681 ; Optim SGD LR 0.1000 ; Source tokens/s 82721 ; Perplexity nan

Any insights on why this might be happening? How can I troubleshoot this problem?

Thanks!
Shruti

vince62s · July 26, 2017, 5:42pm

Hi,
Are you training directly ark features to transcription ?
maybe check if some target data are not empty / blank, or length zero.

Shruti · July 26, 2017, 9:14pm

Hi Vince, I am converting the arks to texts using Kaldi copy-feats before training. I checked if any utterances were empty or length 0, but thats not the case. I have used this data on a different model where it works fine.

Also, this training is taking too long. It takes about 12 hours for just 1 epoch. Total data is about 90 hours.

jean.senellart · July 27, 2017, 12:30pm

Hello @Shruti, NaN is reflecting some issue in the forward or backward calculation path that we might have overlooked. Since the problem seems to be happening quite fast, would you mind trying to narrow down on a subset of your data that you could share so we can investigate?
Can you also try -pdbrnn_merge sum just to see if we can pinpoint where the problem is happening? For the speed, it seems far too slow indeed - what is your hardware?
Thanks
Jean

Shruti · July 28, 2017, 7:20pm

Hi Jean, thanks for looking into this. What do you mean by problem is happening quite fast? I have tried running on 10% and 50% of my data and it runs fine. How can I see which utterances ids in any particular minibatch are giving problem?

My hardware is a Titan X gpu, hence this slow speeds are very unexpected. On similar size data, training did not take this long. Does parameter tuning also affect training speed?

In the log that I have shared, could you please take a look at the time after the 5th minibatch. The model took too long to complete minibatch number 6. I have tried the system again, and it gives exactly the same problems as I reported above.

Shruti · July 29, 2017, 1:54pm

Hi, NaN perplexity problem got solved by reducing the learning rate.

But the system is still very slow. It takes 20 hours for 1 epoch of 14700 iterations (90 hours of speech data, batch size 4). There are periods when the training just suspends. It uses memory on the GPU but there is no GPU utilization. This happens more often when data size is large.

I also encountered a scenario where the training crashed–log did not update and no models/ckpts were saved but the GPU was used continuously (memory + processing).

Please look into this optimization soon.

Thank you!

guillaumekln · July 30, 2017, 9:00pm

Can you increase the batch size?

jean.senellart · July 31, 2017, 2:25am

Hi Shruti,

the learning rate should not be leading to this NaN except if there is an obvious divergence problem - so we need to keep this NaN in mind till we can reproduce it on a small set.

Regarding speed - I see that there are huge variation of throughput from one mini-batch to the other

this is very odd. I wonder if there is not some totally wrong entries in your data that could explain both proble, - can you share your preprocess logs also?

Also, as Guillaume mentioned, you batch size is 4 - did you try increasing it?

Thanks
Jean

Shruti · July 31, 2017, 3:46am

Increasing the batch size beyond 4 causes my system to crash. The GPU runs out of memory, so I need to keep it 4. Here are my preprocessing logs:
[07/28/17 15:28:20 INFO] Preparing vocabulary…
[07/28/17 15:28:20 INFO] * Building target vocabularies…
[07/28/17 15:28:28 INFO] * Created word dictionary of size 74 (pruned from 74)
[07/28/17 15:28:28 INFO]
[07/28/17 15:28:28 INFO] Preparing training data…
[07/28/17 17:15:02 INFO] … shuffling sentences
[07/28/17 17:15:04 INFO] … sorting sentences by size
[07/28/17 17:15:06 INFO] Prepared 55129 sentences:
[07/28/17 17:15:06 INFO] * 368 sequences not validated (length, other)
[07/28/17 17:15:06 INFO] * average sequence length: source = 554.2, target = 56.7
[07/28/17 17:15:06 INFO] * % of unknown words: source = 0.0%, target = 0.0%
[07/28/17 17:15:06 INFO] * source sentence length (range of 10): [ 0% ; 0% ; 0% ; 0% ; 0% ; 0% ; 0% ; 0% ; 0% ; 97% ]
[07/28/17 17:15:06 INFO] * target sentence length (range of 10): [ 1% ; 9% ; 13% ; 13% ; 12% ; 10% ; 8% ; 7% ; 5% ; 16% ]
[07/28/17 17:15:06 INFO]
[07/28/17 17:15:06 INFO] Preparing validation data…
[07/28/17 17:21:09 INFO] … shuffling sentences
[07/28/17 17:21:09 INFO] … sorting sentences by size
[07/28/17 17:21:09 INFO] Prepared 2923 sentences:
[07/28/17 17:21:09 INFO] * 18 sequences not validated (length, other)
[07/28/17 17:21:09 INFO] * average sequence length: source = 564.7, target = 57.6
[07/28/17 17:21:09 INFO] * % of unknown words: source = 0.0%, target = 0.0%
[07/28/17 17:21:09 INFO] * source sentence length (range of 10): [ 0% ; 0% ; 0% ; 0% ; 0% ; 0% ; 0% ; 0% ; 1% ; 97% ]
[07/28/17 17:21:09 INFO] * target sentence length (range of 10): [ 0% ; 11% ; 13% ; 11% ; 13% ; 11% ; 8% ; 6% ; 5% ; 18% ]
[07/28/17 17:21:09 INFO]
[07/28/17 17:21:09 INFO] Saving target vocabulary to ‘how-to-data/training_feats_try3.tgt.dict’…
[07/28/17 17:21:09 INFO] Saving data to ‘how-to-data/training_feats_try3-train.t7’…

Thank you!