I have been reading many literatures saying that with more layers on the encoder, the vanilla transformers model would fail due to the vanishing gradient. However, when I increase the layers of the encoder using the Open NMT codebase such as increasing the encoder layers from 6 to 30 or 60, I did not observe the failure or divergent results from the vanilla transformer. If possible, may anyone advise on this? Thanks very much!
Hello @szhang42,
This is normal. Because the transformer implementation in OpenNMT-py is NOT the vanilla version described in the original paper which is also referred to PostNorm Transformer. Instead, it’s a PreNorm variant in OpenNMT-py codebase. You could find quite a few papers discuss these two variants, but in general, the PreNorm one is easier and more stable to train.