Training slows down within epoch + curriculum + TER

Hi, the training slows down as it progresses through the mini batches within an epoch. For e.g., 100 mini batches took <1 minute towards the beginning of training and when towards the end, 20 mini batches take >7 mins. Is this behavior common? (My dataset is large ~90 hrs speech and 20800 mini batches (BS=8))

I was using curriculum learning for the above exp. Is the “source tokens” value where I should be able to see increasing order of source tokens?

Also, I am trying to evaluate this experiment on the TER. My log still reports perplexity for training. Is that how it is?

Finally, in this experiment, after 1 epoch finished training, the evaluation on dev set entered some infinite loop and was using 100% CPU and 100% memory, not GPU. There are no error logs, so I’m not sure what led to this.

I realize I’ve included a lot of things in this post, but I request you to please answer all.

Thanking you,
Shruti

If you used the -curriculum option then the slowdown is expected. Note the option description:

For this many epochs, order the minibatches based on source length (from smaller to longer).

The training will see input sequences of increasing length which are increasingly costly to process.


For the validation issue, there may be a bug. How large is your validation dataset? Does it work with other validation metric?

Hi Guillaume,

Yes, what you described for -curriculum, is what I expected, but did not see in the logs. There was no increasing order in the “source tokens”. So I don’t understand why training was taking longer towards the end, unless if “source tokes” doesn’t represent the results of -curriculum. Here is a sample log:

[08/05/17 13:46:15 INFO] Epoch 1 ; Iteration 20/20739 ; Optim SGD LR 0.1000 ; Source tokens/s 3711 ; Perplexity 29.01
[08/05/17 13:46:21 INFO] Epoch 1 ; Iteration 40/20739 ; Optim SGD LR 0.1000 ; Source tokens/s 44684 ; Perplexity 20.91
[08/05/17 13:46:25 INFO] Epoch 1 ; Iteration 60/20739 ; Optim SGD LR 0.1000 ; Source tokens/s 66664 ; Perplexity 18.81
[08/05/17 13:46:33 INFO] Epoch 1 ; Iteration 80/20739 ; Optim SGD LR 0.1000 ; Source tokens/s 48040 ; Perplexity 20.20
[08/05/17 13:46:39 INFO] Epoch 1 ; Iteration 100/20739 ; Optim SGD LR 0.1000 ; Source tokens/s 61129 ; Perplexity 19.01
[08/05/17 13:46:44 INFO] Epoch 1 ; Iteration 120/20739 ; Optim SGD LR 0.1000 ; Source tokens/s 74471 ; Perplexity 19.40
[08/05/17 13:46:51 INFO] Epoch 1 ; Iteration 140/20739 ; Optim SGD LR 0.1000 ; Source tokens/s 57274 ; Perplexity 17.73
.
.
[08/06/17 00:37:22 INFO] Epoch 1 ; Iteration 20700/20739 ; Optim SGD LR 0.1000 ; Source tokens/s 12357 ; Perplexity 2.46
[08/06/17 00:45:18 INFO] Epoch 1 ; Iteration 20720/20739 ; Optim SGD LR 0.1000 ; Source tokens/s 10046 ; Perplexity 2.68
[08/06/17 01:12:16 INFO] Epoch 1 ; Iteration 20739/20739 ; Optim SGD LR 0.1000 ; Source tokens/s 3265 ; Perplexity 2.55
[08/06/17 01:12:16 INFO] Evaluating on the validation dataset...

For the validation issue, my validation set is about 4 hours. And yes, this set worked when evaluating perplexity.

Also, you might see, the model seems to have learnt a lot in the first epoch itself, but when i test it, the PPL is not good. Could this be a case of overfitting?

The logs report source tokens per second so the value should decrease as the training slows down.

Depends on many things. What is your task again?

My task is speech recognition

So this is related to the issue you reported:

How is the test accuracy (or whatever metric you are using)?

Hi Guillaume,

Yes, when I reported the low PPL a few days ago, I think the problem might have been the mismatched utterances. I hadn’t taken a closer look that time.

My error rate as of now is 87%. Though training is very very slow for me so the system hasn’t converged yet. Each epoch takes ~18 hours with batch size 4 (cannot use higher because of memory issues). Any pointers on how I can speed this up?

Hi @Shruti, error rate (WER/CER?) of 87% seems far too high. For reference, on equivalent task on WSJ dataset training, pyramidal BRNN decreases memory usage, and scheduled sampling increases convergence speed.
Reducing maximal sequence length, and network size will also reduce your memory usage so you can increase batch size. keep us updated.

Hi Jean,

This is CER. Yes, I am waiting to try out the scheduled sampling feature.

The pyramidal encoder uses about 4.3 GBs of memory for me (BS=4), while my GPU has ~12 GB. But if I increase the batch size, it goes out of memory.

What is the avg. or max source length in the kaldi ark feats? Is it the number of frames in each utterance?