OpenNMT Forum

In a Transformer model, why does one sum positional encoding to the embedding rather than concatenate it?

This question comes from here , but is unanswered.

Can someone here answer it HERE please?

It seems kind of crazy to ADD positional encoding to the attention and thus - seemingly irreversibly - confound them. Do we just expect NNs to be so good at de-convoluting things? But why give it unnecessary work which would hit the LR. Or am I missing something? Does add mean concatenate?

Concatenation would make more sense.

So why does the standard Transformer imp work?

1 Like