I have question about the transformer in SubLayerConnection, is the code wrong?

In git_code and tutorial

The residual connection is applied by x + dropout( sublayer( layer_norm( x ) ) ), while the paper and the tutorial formula is layer_norm( x + dropout( sublayer( x ) ) )

Is it a mistake in your code? Or it is from some deliberate consideration?


See here:

aha, thank you