Word embedding before training

Opennmt is a really cool project.

I’d like to use pre trained word2vec for this model, but I meet a problem,

For a source sentence or target sentence for training, some words are not contained in the pre-trained word2vec model, so it will be set as UNK and how can I initialize the vector for UNK?

Thanks:)

I did it here:

I used this procedure:

  1. build my own dict files from the data with my own criterion. Don’t forget to add <s> </s> <unk> <blank> tags in the dict files. For this step, you can also use ONMT tools
  2. pre-process the training files, adding <s> and </s> tags, and replacing all word not in the dict files by <unk>
  3. w2v training on preprocessed files, saving the result as glove files
  4. convert glove files in t7 files using ONMT tools
  5. use the t7 files as pre-trained embeddings

:wink:

2 Likes

great, it’s really helpful

if the pre trained word2vec doesn’t contain tag , how can I initialize the vector for unk?

For me, you can’t. It will be assigned a random value. It’s the reason why I added it in the training files before w2v training… to get it, exactly as a word, in the w2v output embeddings.

Yes, but before training, it’s very difficult to choose the words to be UNK.

If I initialise the value of UNK vector randomly (or average of the rare words), it’s good or not?

If you want, for example 50000 words, you need to do a prior choice on the words you put in the dicts, or the ONMT tools will do this choice for you. Then, all words not in this 50000 dicts are becoming <unk>.