OpenNMT Forum

How can I use embeddings?

In embeddings page there is a guide to map pretrained word2vec vectors.
http://opennmt.net/OpenNMT/training/embeddings/

th tools/embeddings.lua -embed_type word2vec -embed_file data/GoogleNews-vectors-negative300.bin -dict_file data/demo.src.dict\ -save_data data/demo-src-emb

I think that -embed_file data/GoogleNews-vectors-negative300.bin means the file generated from word2vec.
But what is the -dict_file data/demo.src.dict means?
Is that the *.dict file I generate from preprocessing?

I can’t understand well what is the relation of preprocessing and embeddings.
Please tell me any advice about embeddings.

Thank you.

Yes.

The script will iterate on the embedding file and assign the pretrained vector to each word in the vocabulary. If a word in the vocabulary does not have a corresponding pretrained vector, it is assigned to a random vector.

Thanks for your reply!

After this process, do I have to put these options when i train my files?

-pre_word_vecs_enc <string> (default: '')
Path to pretrained word embeddings on the encoder side serialized as a Torch tensor.
-pre_word_vecs_dec <string> (default: '')
Path to pretrained word embeddings on the decoder side serialized as a Torch tensor.

When i use my own embedding files, performance is always improved?
I am not so sure because the thing they mentioned in here.
http://opennmt.net/OpenNMT/training/embeddings/

When training with small amounts of data, 
performance can be improved by starting with pretrained embeddings. 

Yes, just pass the generated embedding files as detailed in tools/embeddings.lua's logs.

During the first epochs usually yes, but if you have lots of data the gain of pretrained embeddings will be less clear.

Hello.

I try use word embeddings from GoogleNews-vectors-negative300.bin with -embed_type word2vec-bin.

And have this result for my english vocabulary.

[08/21/18 15:41:34 INFO] … 3000000 embeddings processed (2157/9004 matched with the dictionary)
[08/21/18 15:41:34 INFO] * 2157/9004 embeddings matched with dictionary tokens
[08/21/18 15:41:36 INFO] * 0 approximate lookup
[08/21/18 15:41:36 INFO] * 6847/9004 vocabs randomly assigned with a normal distribution

And have this result for my english vocabulary.

[08/21/18 15:44:08 INFO] … 1800000 embeddings processed (8243/9004 matched with the dictionary)
[08/21/18 15:44:16 INFO] * 8243/9004 embeddings matched with dictionary tokens
[08/21/18 15:44:16 INFO] * 0 approximate lookup
[08/21/18 15:44:16 INFO] * 761/9004 vocabs randomly assigned with a normal distribution

After i train my en-ru model with predefined word embeddings. And i have bad translation result.
For example: car. > Я .
But in this time i train another model with same source and without predefined word embeddings.
For example: car. > машина.

Any help with it?

I want to use bert pr-trained embeddings as an input to the neural network models, How am I supposed to integrate it with the opennmt

Have a look at this thread: Bidirectional transformers in OpenNMT-tf
If you want pretrained embeddings in a shared multilingual space, then you could use already the aligned fast text embeddings: https://fasttext.cc/docs/en/aligned-vectors.html