Question about using the Pretrained embedding

sywing · December 7, 2018, 4:56am

in opennmt-tf, I add one class to use the pretrained embedding with transformer, like below:

super(Transformer, self).init(
source_inputter=onmt.inputters.WordEmbedder(
vocabulary_file_key=“source_words_vocabulary”,
embedding_file_key=“src_embedding”,
embedding_size=512,
dtype=dtype),
target_inputter=onmt.inputters.WordEmbedder(
vocabulary_file_key=“target_words_vocabulary”,
embedding_file_key=“tgt_embedding”,
embedding_size=512,
dtype=dtype),
num_layers=6,
num_units=512,
num_heads=8,
ffn_inner_dim=2048,
dropout=0.1,
attention_dropout=0.1,
relu_dropout=0.1)

Here is my question:
1、Maybe embedding_file_key means the glove or word2vec embedding files,
but their dimention are generally 300 or 200, so can I set “embedding_size=512” too?

2、like using pretrained embedding in opennmt-py(How to use GloVe pre-trained embeddings in OpenNMT-py)

there are many missing embeddings(the results of opennmt-py below):

enc: 20925 match, 8793 missing, (70.41%)
- dec: 20923 match, 13342 missing, (61.06%)
  Filtered embeddings:
enc: torch.Size([29718, 300])
dec: torch.Size([34265, 200])

how does the system handle the missing embedding? Are the misssing parts tokens embedding under the random initialization at the beginning of the trainning? then both the miss and unmiss are update when trainning?

guillaumekln · December 7, 2018, 8:44am

You should not set embedding_size, it will be inferred from the embedding file. The Transformer model supports embedding sizes that are different than num_units.

Yes.

sywing · December 9, 2018, 10:28am

Thanks very much for your help! @guillaumekln
Does the setting above is enough?

guillaumekln · December 10, 2018, 12:47pm

none should None. The rest looks good to me.