Fixed vs trainable embeddings

After some experiments I’ve found that having trainable embeddings improves a lot model performance

My question is:

Imagine I train a first model with trainable embeddings, after training is done I extract those embeddings

I then train a second model with identical structure, but this time I use fixed embeddings extracted from the first one

Can I hope to get similar performances on both models ? Or will the second still suffer from fixed embeddings ?

I plan to answer this question by trying but I don’t have the resources for now

I don’t think we can generalize a statement on this.
Models tuning is very specific to each task so you will have to experiment.

1 Like

I have some results to share

Following your advice I won’t take the risk to generalize but here is what I got

(1) I trained a classic transformer on WMT corpus with trainable embeddings. (2) Then I trained the exact same transformer on the same corpus but this time I used pretrained embeddings (obtained via an external tool), fixed at training time.

Differences in results (mainly using BLEU on newstest2017/2018) is huge

(3) Finally I trained a third transformer (still with same configuration), embeddings were also pretrained and fixed. However instead of getting them from an external tool, I just extracted them from the first transformer.

There is no difference in results between the first and the third experiment, implying that using extracted embeddings from the first transformer is equivalent to allow the model to train embeddings, convergence even seems faster in this third experiment

To complete my results I’ll add two points:

  • (4) I ran another transformer, similar to third: with extracted embeddings from the first model but this time trainable. Results are very similar to the third, there is not much gain and convergence is not really faster, meaning that the model cannot improve a lot from fine tuning its embeddings

  • I reproduced all these experiments, this time on a slightly different corpus than WMT. Results are very similar.

It would be too long to write the exact conditions of my experiments so if needed I can provide more details

1 Like

Differences in results (mainly using BLEU on newstest2017/2018) is huge

which one is best ?

(1) trainable is best