Embedding Training on Transformer


Currently, I am training a monolingual model for text generation using Transformer. There are two condition I set, one is using Fasttext 157 language pretrained word embedding on both the source language and target language, and the other didn’t use any pretrained word embedding. Here are the parameter of my pretrained embedding:
path: /content/cc.id.300.vec
with_header: True
case_insensitive: False
trainable: True
Surprisingly, the one that didn’t use any pretrained word embedding scores higher. That lead me to several questions:

  1. If a model didn’t use pretrained word embedding, how do they initialized? Is it random or in uniform value?
  2. Do model which use pretrained word embedding got trained the same way as the one which didn’t use?
  3. How do embedding layer treat words that are not found in the pretrained embeddings but exist in the vocabulary? Are they trained during the training or left with default 0 value?

Any help would be appreciated, thank you

All weights are initialized with the Glorot uniform initializer.


They are initialized randomly and trained the same way as other embeddings.

Thank you for the answer.

My additional question, are they initialized using Glorot Uniform Initializer too? Or using another initializer function?

They are initialized with a random normal distribution: