Using word2vec embeddings in onmt.py

david · March 3, 2018, 11:20pm

Hello I was just wondering if using word2vec embeddings is exactly the same process as with glove.

my word2vec text file has one line that states vocab and vector size. Then it is followed by lines of vectors (of size 100 in this case). Also I have only created embeddings for the target language.
So far so good, however do I need to do anything extra to this file or can I just do this:

_./tools/embeddings_to_torch.py -emb_file “myWord2vec_emb.txt” _
_-dict_file “data/data.vocab.pt” _
-output_file "data/embeddings"

then do I just add these lines to the train.py command:

_-word_vec_size 100 _
_-pre_word_vecs_enc “data/embeddings.enc.pt” _
_-pre_word_vecs_dec “data/embeddings.dec.pt” _

best wishes
David

pltrdy · March 5, 2018, 10:59am

Hi,

From what I’ve seen the only difference between textual word2vec and GloVe is that first line, which we can just ignore.

I’ve submitted a pull request (https://github.com/OpenNMT/OpenNMT-py/pull/580) so that you can use:

./tools/embeddings_to_torch.py -emb_file myWord2vec_emb.txt -dict_file data/data.vocab.pt -output_file data/embeddings -type word2vec

You can pull this code by running

git remote add pltrdy https://github.com/pltrdy/OpenNMT-py.git
git pull pltrdy word2vec_to_torch

david · March 6, 2018, 12:04am

nice one. thank you very much. i will give it a whirl.

himanshudce · June 28, 2018, 6:03pm

hi ,
I am working on machine translation task.So i want to use two different word2vec models on two different vocabulary .But opennmt-py generates only one vocab.pt how can i convert two different vocab to vector using two different word2vec