Using word2vec embeddings in onmt.py


#1

Hello I was just wondering if using word2vec embeddings is exactly the same process as with glove.

my word2vec text file has one line that states vocab and vector size. Then it is followed by lines of vectors (of size 100 in this case). Also I have only created embeddings for the target language.
So far so good, however do I need to do anything extra to this file or can I just do this:

_./tools/embeddings_to_torch.py -emb_file “myWord2vec_emb.txt” _
_-dict_file “data/data.vocab.pt” _
-output_file "data/embeddings"

then do I just add these lines to the train.py command:

_-word_vec_size 100 _
_-pre_word_vecs_enc “data/embeddings.enc.pt” _
_-pre_word_vecs_dec “data/embeddings.dec.pt” _

best wishes
David


(Pltrdy) #2

Hi,

From what I’ve seen the only difference between textual word2vec and GloVe is that first line, which we can just ignore.

I’ve submitted a pull request (https://github.com/OpenNMT/OpenNMT-py/pull/580) so that you can use:

./tools/embeddings_to_torch.py -emb_file myWord2vec_emb.txt -dict_file data/data.vocab.pt -output_file data/embeddings -type word2vec

You can pull this code by running

git remote add pltrdy https://github.com/pltrdy/OpenNMT-py.git
git pull pltrdy word2vec_to_torch

#3

nice one. thank you very much. i will give it a whirl.


(himanshu) #4

hi ,
I am working on machine translation task.So i want to use two different word2vec models on two different vocabulary .But opennmt-py generates only one vocab.pt how can i convert two different vocab to vector using two different word2vec