Word embedding before training

silver · March 15, 2017, 3:47pm

Opennmt is a really cool project.

I’d like to use pre trained word2vec for this model, but I meet a problem,

For a source sentence or target sentence for training, some words are not contained in the pre-trained word2vec model, so it will be set as UNK and how can I initialize the vector for UNK?

Thanks:)

Etienne38 · March 15, 2017, 3:56pm

I did it here:

I used this procedure:

build my own dict files from the data with my own criterion. Don’t forget to add <s> </s> <unk> <blank> tags in the dict files. For this step, you can also use ONMT tools
pre-process the training files, adding <s> and </s> tags, and replacing all word not in the dict files by <unk>
w2v training on preprocessed files, saving the result as glove files
convert glove files in t7 files using ONMT tools
use the t7 files as pre-trained embeddings

silver · March 15, 2017, 4:33pm

great, it’s really helpful

if the pre trained word2vec doesn’t contain tag , how can I initialize the vector for unk?

Etienne38 · March 15, 2017, 4:35pm

For me, you can’t. It will be assigned a random value. It’s the reason why I added it in the training files before w2v training… to get it, exactly as a word, in the w2v output embeddings.

silver · March 15, 2017, 4:51pm

Yes, but before training, it’s very difficult to choose the words to be UNK.

If I initialise the value of UNK vector randomly (or average of the rare words), it’s good or not?

Etienne38 · March 15, 2017, 5:02pm

If you want, for example 50000 words, you need to do a prior choice on the words you put in the dicts, or the ONMT tools will do this choice for you. Then, all words not in this 50000 dicts are becoming <unk>.