How can I use embeddings?


#1

In embeddings page there is a guide to map pretrained word2vec vectors.
http://opennmt.net/OpenNMT/training/embeddings/

th tools/embeddings.lua -embed_type word2vec -embed_file data/GoogleNews-vectors-negative300.bin -dict_file data/demo.src.dict\ -save_data data/demo-src-emb

I think that -embed_file data/GoogleNews-vectors-negative300.bin means the file generated from word2vec.
But what is the -dict_file data/demo.src.dict means?
Is that the *.dict file I generate from preprocessing?

I can’t understand well what is the relation of preprocessing and embeddings.
Please tell me any advice about embeddings.

Thank you.


(Guillaume Klein) #2

Yes.

The script will iterate on the embedding file and assign the pretrained vector to each word in the vocabulary. If a word in the vocabulary does not have a corresponding pretrained vector, it is assigned to a random vector.


#3

Thanks for your reply!

After this process, do I have to put these options when i train my files?

-pre_word_vecs_enc <string> (default: '')
Path to pretrained word embeddings on the encoder side serialized as a Torch tensor.
-pre_word_vecs_dec <string> (default: '')
Path to pretrained word embeddings on the decoder side serialized as a Torch tensor.

When i use my own embedding files, performance is always improved?
I am not so sure because the thing they mentioned in here.
http://opennmt.net/OpenNMT/training/embeddings/

When training with small amounts of data, 
performance can be improved by starting with pretrained embeddings. 

Will pretrained embeddings also be updated during training?
(Guillaume Klein) #4

Yes, just pass the generated embedding files as detailed in tools/embeddings.lua's logs.

During the first epochs usually yes, but if you have lots of data the gain of pretrained embeddings will be less clear.


(Lpzreq) #5

Hello.

I try use word embeddings from GoogleNews-vectors-negative300.bin with -embed_type word2vec-bin.

And have this result for my english vocabulary.

[08/21/18 15:41:34 INFO] … 3000000 embeddings processed (2157/9004 matched with the dictionary)
[08/21/18 15:41:34 INFO] * 2157/9004 embeddings matched with dictionary tokens
[08/21/18 15:41:36 INFO] * 0 approximate lookup
[08/21/18 15:41:36 INFO] * 6847/9004 vocabs randomly assigned with a normal distribution

And have this result for my english vocabulary.

[08/21/18 15:44:08 INFO] … 1800000 embeddings processed (8243/9004 matched with the dictionary)
[08/21/18 15:44:16 INFO] * 8243/9004 embeddings matched with dictionary tokens
[08/21/18 15:44:16 INFO] * 0 approximate lookup
[08/21/18 15:44:16 INFO] * 761/9004 vocabs randomly assigned with a normal distribution

After i train my en-ru model with predefined word embeddings. And i have bad translation result.
For example: car. > Я .
But in this time i train another model with same source and without predefined word embeddings.
For example: car. > машина.

Any help with it?