Deep transformer with more layers

szhang42 · August 7, 2021, 2:24pm

Hello,

I have been reading many literatures saying that with more layers on the encoder, the vanilla transformers model would fail due to the vanishing gradient. However, when I increase the layers of the encoder using the Open NMT codebase such as increasing the encoder layers from 6 to 30 or 60, I did not observe the failure or divergent results from the vanilla transformer. If possible, may anyone advise on this? Thanks very much!

Zenglinxiao · August 24, 2021, 12:28pm

Hello @szhang42,
This is normal. Because the transformer implementation in OpenNMT-py is NOT the vanilla version described in the original paper which is also referred to PostNorm Transformer. Instead, it’s a PreNorm variant in OpenNMT-py codebase. You could find quite a few papers discuss these two variants, but in general, the PreNorm one is easier and more stable to train.

bibs:

szhang42 · August 24, 2021, 10:15pm

Hello @Zenglinxiao

Got it! Thanks very much for the reply. Very appreciate it.