I’m training an encoder-decoder model on an English to Japanese dataset. The observed results depict that the baseline model (1 layer encoder, 1 layer decoder, 12 epochs) is better in terms of training accuracy than the model with 2 layered encoder and 3 layered decoder at epoch 12. Isn’t more layers supposed overfit and increase the training accuracy. Or is there an empirical rule for choosing epoch value for a specific set of encoder decoder layers?
How many training data did you use?
The training set’s of size 10000 sentences and validation is of 500 sentences.
Bigger model means more parameters which require more iterations to be properly optimized. This implies more epoch or more data.
Note that 10,000 examples are not enough to train a NMT system (at least a decent one). People usually work with 1M+ sentences.
Thanks for the reply. This is an academic assignment and that is the reason we are dealing with such a short corpus. Here, we are only studying the effects of multiple layers and their perplexity, BLEU on a training set.
Due to GPU constraints, I was looking for a range epoch values instead of tuning its value. Thanks again.