Hi, the training slows down as it progresses through the mini batches within an epoch. For e.g., 100 mini batches took <1 minute towards the beginning of training and when towards the end, 20 mini batches take >7 mins. Is this behavior common? (My dataset is large ~90 hrs speech and 20800 mini batches (BS=8))
I was using curriculum learning for the above exp. Is the “source tokens” value where I should be able to see increasing order of source tokens?
Also, I am trying to evaluate this experiment on the TER. My log still reports perplexity for training. Is that how it is?
Finally, in this experiment, after 1 epoch finished training, the evaluation on dev set entered some infinite loop and was using 100% CPU and 100% memory, not GPU. There are no error logs, so I’m not sure what led to this.
I realize I’ve included a lot of things in this post, but I request you to please answer all.
Yes, what you described for -curriculum, is what I expected, but did not see in the logs. There was no increasing order in the “source tokens”. So I don’t understand why training was taking longer towards the end, unless if “source tokes” doesn’t represent the results of -curriculum. Here is a sample log:
For the validation issue, my validation set is about 4 hours. And yes, this set worked when evaluating perplexity.
Also, you might see, the model seems to have learnt a lot in the first epoch itself, but when i test it, the PPL is not good. Could this be a case of overfitting?
Yes, when I reported the low PPL a few days ago, I think the problem might have been the mismatched utterances. I hadn’t taken a closer look that time.
My error rate as of now is 87%. Though training is very very slow for me so the system hasn’t converged yet. Each epoch takes ~18 hours with batch size 4 (cannot use higher because of memory issues). Any pointers on how I can speed this up?
Hi @Shruti, error rate (WER/CER?) of 87% seems far too high. For reference, on equivalent task on WSJ dataset training, pyramidal BRNN decreases memory usage, and scheduled sampling increases convergence speed.
Reducing maximal sequence length, and network size will also reduce your memory usage so you can increase batch size. keep us updated.