Missing words in embedding

sadanyh · August 13, 2020, 9:22am

Hi,

I am training an Arabic-English model. I have created my own pre-trained word embeddings for both the source and the target datasets. When using the embeddings_to_torch.py to prepare the embeddings for OpenNMT I always get a high percentage of missing words either on the decoder or encoder side. How can this be since I have all the words in my data included in the VSM I created? Also, what happens during training to the missing vectors, how are they initialized? and how far does this affect my accuracy?
I also tried to use the embeddings_to_torch.py with only my encoder side embeddings, but it won’t let me. Do I have to feed in both encoder and decoder pre-trained embeddings? can I just only give my encoder embeddings in the emb-all file?
Thank you
Hadeel