Deep transformer with more layers

Hello,

I have been reading many literatures saying that with more layers on the encoder, the vanilla transformers model would fail due to the vanishing gradient. However, when I increase the layers of the encoder using the Open NMT codebase such as increasing the encoder layers from 6 to 30 or 60, I did not observe the failure or divergent results from the vanilla transformer. If possible, may anyone advise on this? Thanks very much!

Hello @szhang42,
This is normal. Because the transformer implementation in OpenNMT-py is NOT the vanilla version described in the original paper which is also referred to PostNorm Transformer. Instead, it’s a PreNorm variant in OpenNMT-py codebase. You could find quite a few papers discuss these two variants, but in general, the PreNorm one is easier and more stable to train.

bibs:

  1. Learning Deep Transformer Models for Machine Translation
  2. Understanding the Difficulty of Training Transformers
4 Likes

Hello @Zenglinxiao

Got it! Thanks very much for the reply. Very appreciate it.