How can I use embeddings?


In embeddings page there is a guide to map pretrained word2vec vectors.

th tools/embeddings.lua -embed_type word2vec -embed_file data/GoogleNews-vectors-negative300.bin -dict_file data/demo.src.dict\ -save_data data/demo-src-emb

I think that -embed_file data/GoogleNews-vectors-negative300.bin means the file generated from word2vec.
But what is the -dict_file data/demo.src.dict means?
Is that the *.dict file I generate from preprocessing?

I can’t understand well what is the relation of preprocessing and embeddings.
Please tell me any advice about embeddings.

Thank you.

(Guillaume Klein) #2


The script will iterate on the embedding file and assign the pretrained vector to each word in the vocabulary. If a word in the vocabulary does not have a corresponding pretrained vector, it is assigned to a random vector.


Thanks for your reply!

After this process, do I have to put these options when i train my files?

-pre_word_vecs_enc <string> (default: '')
Path to pretrained word embeddings on the encoder side serialized as a Torch tensor.
-pre_word_vecs_dec <string> (default: '')
Path to pretrained word embeddings on the decoder side serialized as a Torch tensor.

When i use my own embedding files, performance is always improved?
I am not so sure because the thing they mentioned in here.

When training with small amounts of data, 
performance can be improved by starting with pretrained embeddings. 

Will pretrained embeddings also be updated during training?
(Guillaume Klein) #4

Yes, just pass the generated embedding files as detailed in tools/embeddings.lua's logs.

During the first epochs usually yes, but if you have lots of data the gain of pretrained embeddings will be less clear.