In a Transformer model, why does one sum positional encoding to the embedding rather than concatenate it?

This question comes from here , but is unanswered.

Can someone here answer it HERE please?

It seems kind of crazy to ADD positional encoding to the attention and thus - seemingly irreversibly - confound them. Do we just expect NNs to be so good at de-convoluting things? But why give it unnecessary work which would hit the LR. Or am I missing something? Does add mean concatenate?

Concatenation would make more sense.

So why does the standard Transformer imp work?

1 Like

Anyone . . ?

Adding positional encodings is no different than adding bias in fully connected layers. Maybe that helps answer your question?

There is nothing that tells the next layer which part comes from the word embedding and which part come from positional encoding. So the NN would also need to figure this out.