The residual connection is applied by x + dropout( sublayer( layer_norm( x ) ) )
, while the paper and the tutorial formula is layer_norm( x + dropout( sublayer( x ) ) )
Is it a mistake in your code? Or it is from some deliberate consideration?
Thanks