This question comes from here , but is unanswered.
Can someone here answer it HERE please?
It seems kind of crazy to ADD positional encoding to the attention and thus - seemingly irreversibly - confound them. Do we just expect NNs to be so good at de-convoluting things? But why give it unnecessary work which would hit the LR. Or am I missing something? Does add mean concatenate?
Concatenation would make more sense.
So why does the standard Transformer imp work?