in opennmt-tf, I add one class to use the pretrained embedding with transformer, like below:
super(Transformer, self).init(
source_inputter=onmt.inputters.WordEmbedder(
vocabulary_file_key=“source_words_vocabulary”,
embedding_file_key=“src_embedding”,
embedding_size=512,
dtype=dtype),
target_inputter=onmt.inputters.WordEmbedder(
vocabulary_file_key=“target_words_vocabulary”,
embedding_file_key=“tgt_embedding”,
embedding_size=512,
dtype=dtype),
num_layers=6,
num_units=512,
num_heads=8,
ffn_inner_dim=2048,
dropout=0.1,
attention_dropout=0.1,
relu_dropout=0.1)
Here is my question:
1、Maybe embedding_file_key means the glove or word2vec embedding files,
but their dimention are generally 300 or 200, so can I set “embedding_size=512” too?
2、like using pretrained embedding in opennmt-py(How to use GloVe pre-trained embeddings in OpenNMT-py)
there are many missing embeddings(the results of opennmt-py below):
- enc: 20925 match, 8793 missing, (70.41%)
- dec: 20923 match, 13342 missing, (61.06%)
Filtered embeddings:
- dec: 20923 match, 13342 missing, (61.06%)
- enc: torch.Size([29718, 300])
- dec: torch.Size([34265, 200])
how does the system handle the missing embedding? Are the misssing parts tokens embedding under the random initialization at the beginning of the trainning? then both the miss and unmiss are update when trainning?