How can I use embeddings?

jiny · January 25, 2018, 4:26am

In embeddings page there is a guide to map pretrained word2vec vectors.
http://opennmt.net/OpenNMT/training/embeddings/

th tools/embeddings.lua -embed_type word2vec -embed_file data/GoogleNews-vectors-negative300.bin -dict_file data/demo.src.dict\ -save_data data/demo-src-emb

I think that -embed_file data/GoogleNews-vectors-negative300.bin means the file generated from word2vec.
But what is the -dict_file data/demo.src.dict means?
Is that the *.dict file I generate from preprocessing?

I can’t understand well what is the relation of preprocessing and embeddings.
Please tell me any advice about embeddings.

Thank you.

guillaumekln · January 25, 2018, 8:30am

Yes.

The script will iterate on the embedding file and assign the pretrained vector to each word in the vocabulary. If a word in the vocabulary does not have a corresponding pretrained vector, it is assigned to a random vector.

jiny · January 29, 2018, 2:32am

Thanks for your reply!

After this process, do I have to put these options when i train my files?

-pre_word_vecs_enc <string> (default: '')
Path to pretrained word embeddings on the encoder side serialized as a Torch tensor.
-pre_word_vecs_dec <string> (default: '')
Path to pretrained word embeddings on the decoder side serialized as a Torch tensor.

When i use my own embedding files, performance is always improved?
I am not so sure because the thing they mentioned in here.
http://opennmt.net/OpenNMT/training/embeddings/

When training with small amounts of data, 
performance can be improved by starting with pretrained embeddings.

guillaumekln · January 30, 2018, 9:41am

Yes, just pass the generated embedding files as detailed in tools/embeddings.lua's logs.

During the first epochs usually yes, but if you have lots of data the gain of pretrained embeddings will be less clear.

lpzreq · August 23, 2018, 9:21am

Hello.

I try use word embeddings from GoogleNews-vectors-negative300.bin with -embed_type word2vec-bin.

And have this result for my english vocabulary.

[08/21/18 15:41:34 INFO] … 3000000 embeddings processed (2157/9004 matched with the dictionary)
[08/21/18 15:41:34 INFO] * 2157/9004 embeddings matched with dictionary tokens
[08/21/18 15:41:36 INFO] * 0 approximate lookup
[08/21/18 15:41:36 INFO] * 6847/9004 vocabs randomly assigned with a normal distribution

And have this result for my english vocabulary.

[08/21/18 15:44:08 INFO] … 1800000 embeddings processed (8243/9004 matched with the dictionary)
[08/21/18 15:44:16 INFO] * 8243/9004 embeddings matched with dictionary tokens
[08/21/18 15:44:16 INFO] * 0 approximate lookup
[08/21/18 15:44:16 INFO] * 761/9004 vocabs randomly assigned with a normal distribution

After i train my en-ru model with predefined word embeddings. And i have bad translation result.
For example: car. > Я .
But in this time i train another model with same source and without predefined word embeddings.
For example: car. > машина.

Any help with it?

shakeel608 · May 29, 2019, 11:06am

I want to use bert pr-trained embeddings as an input to the neural network models, How am I supposed to integrate it with the opennmt

Bachstelze · June 8, 2019, 11:36am

Have a look at this thread: Bidirectional transformers in OpenNMT-tf
If you want pretrained embeddings in a shared multilingual space, then you could use already the aligned fast text embeddings: https://fasttext.cc/docs/en/aligned-vectors.html