Choosing number of epochs for a stacked encoder decoder model

performance

(Jeffrey Michael) #1

I’m training an encoder-decoder model on an English to Japanese dataset. The observed results depict that the baseline model (1 layer encoder, 1 layer decoder, 12 epochs) is better in terms of training accuracy than the model with 2 layered encoder and 3 layered decoder at epoch 12. Isn’t more layers supposed overfit and increase the training accuracy. Or is there an empirical rule for choosing epoch value for a specific set of encoder decoder layers?


(Guillaume Klein) #2

Hello,

How many training data did you use?


(Jeffrey Michael) #3

The training set’s of size 10000 sentences and validation is of 500 sentences.


(Guillaume Klein) #4

Bigger model means more parameters which require more iterations to be properly optimized. This implies more epoch or more data.

Note that 10,000 examples are not enough to train a NMT system (at least a decent one). People usually work with 1M+ sentences.


(Jeffrey Michael) #5

Thanks for the reply. This is an academic assignment and that is the reason we are dealing with such a short corpus. Here, we are only studying the effects of multiple layers and their perplexity, BLEU on a training set.

Due to GPU constraints, I was looking for a range epoch values instead of tuning its value. Thanks again.