How can I use embeddings?


In embeddings page there is a guide to map pretrained word2vec vectors.

th tools/embeddings.lua -embed_type word2vec -embed_file data/GoogleNews-vectors-negative300.bin -dict_file data/demo.src.dict\ -save_data data/demo-src-emb

I think that -embed_file data/GoogleNews-vectors-negative300.bin means the file generated from word2vec.
But what is the -dict_file data/demo.src.dict means?
Is that the *.dict file I generate from preprocessing?

I can’t understand well what is the relation of preprocessing and embeddings.
Please tell me any advice about embeddings.

Thank you.

(Guillaume Klein) #2


The script will iterate on the embedding file and assign the pretrained vector to each word in the vocabulary. If a word in the vocabulary does not have a corresponding pretrained vector, it is assigned to a random vector.


Thanks for your reply!

After this process, do I have to put these options when i train my files?

-pre_word_vecs_enc <string> (default: '')
Path to pretrained word embeddings on the encoder side serialized as a Torch tensor.
-pre_word_vecs_dec <string> (default: '')
Path to pretrained word embeddings on the decoder side serialized as a Torch tensor.

When i use my own embedding files, performance is always improved?
I am not so sure because the thing they mentioned in here.

When training with small amounts of data, 
performance can be improved by starting with pretrained embeddings. 

(Guillaume Klein) #4

Yes, just pass the generated embedding files as detailed in tools/embeddings.lua's logs.

During the first epochs usually yes, but if you have lots of data the gain of pretrained embeddings will be less clear.