Deep transformer with more layers


I have been reading many literatures saying that with more layers on the encoder, the vanilla transformers model would fail due to the vanishing gradient. However, when I increase the layers of the encoder using the Open NMT codebase such as increasing the encoder layers from 6 to 30 or 60, I did not observe the failure or divergent results from the vanilla transformer. If possible, may anyone advise on this? Thanks very much!

Hello @szhang42,
This is normal. Because the transformer implementation in OpenNMT-py is NOT the vanilla version described in the original paper which is also referred to PostNorm Transformer. Instead, it’s a PreNorm variant in OpenNMT-py codebase. You could find quite a few papers discuss these two variants, but in general, the PreNorm one is easier and more stable to train.


  1. Learning Deep Transformer Models for Machine Translation
  2. Understanding the Difficulty of Training Transformers

Hello @Zenglinxiao

Got it! Thanks very much for the reply. Very appreciate it.