Following your advice I won’t take the risk to generalize but here is what I got
(1) I trained a classic transformer on WMT corpus with trainable embeddings. (2) Then I trained the exact same transformer on the same corpus but this time I used pretrained embeddings (obtained via an external tool), fixed at training time.
Differences in results (mainly using BLEU on newstest2017/2018) is huge
(3) Finally I trained a third transformer (still with same configuration), embeddings were also pretrained and fixed. However instead of getting them from an external tool, I just extracted them from the first transformer.
There is no difference in results between the first and the third experiment, implying that using extracted embeddings from the first transformer is equivalent to allow the model to train embeddings, convergence even seems faster in this third experiment
(4) I ran another transformer, similar to third: with extracted embeddings from the first model but this time trainable. Results are very similar to the third, there is not much gain and convergence is not really faster, meaning that the model cannot improve a lot from fine tuning its embeddings
I reproduced all these experiments, this time on a slightly different corpus than WMT. Results are very similar.
It would be too long to write the exact conditions of my experiments so if needed I can provide more details