Currently, I am training a monolingual model for text generation using Transformer. There are two condition I set, one is using Fasttext 157 language pretrained word embedding on both the source language and target language, and the other didn’t use any pretrained word embedding. Here are the parameter of my pretrained embedding:
Surprisingly, the one that didn’t use any pretrained word embedding scores higher. That lead me to several questions:
- If a model didn’t use pretrained word embedding, how do they initialized? Is it random or in uniform value?
- Do model which use pretrained word embedding got trained the same way as the one which didn’t use?
- How do embedding layer treat words that are not found in the pretrained embeddings but exist in the vocabulary? Are they trained during the training or left with default 0 value?
Any help would be appreciated, thank you