I have question about the transformer in SubLayerConnection, is the code wrong?


(Neo) #1

In git_code and tutorial

The residual connection is applied by x + dropout( sublayer( layer_norm( x ) ) ), while the paper and the tutorial formula is layer_norm( x + dropout( sublayer( x ) ) )

Is it a mistake in your code? Or it is from some deliberate consideration?


(Guillaume Klein) #2

See here:

(Neo) #3

aha, thank you