OpenNMT Forum

Question about using the Pretrained embedding

opennmt-tf

#1

Hi,@guillaumekln

in opennmt-tf, I add one class to use the pretrained embedding with transformer, like below:

super(Transformer, self).init(
source_inputter=onmt.inputters.WordEmbedder(
vocabulary_file_key=“source_words_vocabulary”,
embedding_file_key=“src_embedding”,
embedding_size=512,
dtype=dtype),
target_inputter=onmt.inputters.WordEmbedder(
vocabulary_file_key=“target_words_vocabulary”,
embedding_file_key=“tgt_embedding”,
embedding_size=512,
dtype=dtype),
num_layers=6,
num_units=512,
num_heads=8,
ffn_inner_dim=2048,
dropout=0.1,
attention_dropout=0.1,
relu_dropout=0.1)

Here is my question:
1、Maybe embedding_file_key means the glove or word2vec embedding files,
but their dimention are generally 300 or 200, so can I set “embedding_size=512” too?

2、like using pretrained embedding in opennmt-py(How to use GloVe pre-trained embeddings in OpenNMT-py)

there are many missing embeddings(the results of opennmt-py below):

  • enc: 20925 match, 8793 missing, (70.41%)
    • dec: 20923 match, 13342 missing, (61.06%)
      Filtered embeddings:
  • enc: torch.Size([29718, 300])
  • dec: torch.Size([34265, 200])

how does the system handle the missing embedding? Are the misssing parts tokens embedding under the random initialization at the beginning of the trainning? then both the miss and unmiss are update when trainning?


(Guillaume Klein) #2

You should not set embedding_size, it will be inferred from the embedding file. The Transformer model supports embedding sizes that are different than num_units.

Yes.

Yes.


#3

Thanks very much for your help! @guillaumekln
Does the setting above is enough?


(Guillaume Klein) #4

none should None. The rest looks good to me.