Opennmt is a really cool project.
I’d like to use pre trained word2vec for this model, but I meet a problem,
For a source sentence or target sentence for training, some words are not contained in the pre-trained word2vec model, so it will be set as UNK and how can I initialize the vector for UNK?
I did it here:
Since de beginning of this year I'm discovering NMT, with the practice of OpenNMT. Quickly, I was puzzled by the strange behaviour of the training. The scenario is this one: - in the first 2 ou 3 epochs, I get a very interesting model, already doing quite good draft translations. It lets me hope for a wonderful evolution in next epochs.. but, - for several epochs, the model is very slowly improving. It lets me think it will take a lot of time to handle sentences in detail, and produce good trans…
I used this procedure:
build my own dict files from the data with my own criterion. Don’t forget to add
<s> </s> <unk> <blank> tags in the dict files. For this step, you can also use ONMT tools
pre-process the training files, adding
</s> tags, and replacing all word not in the dict files by
w2v training on preprocessed files, saving the result as glove files
convert glove files in t7 files using ONMT tools
use the t7 files as pre-trained embeddings
great, it’s really helpful
if the pre trained word2vec doesn’t contain tag , how can I initialize the vector for unk?
For me, you can’t. It will be assigned a random value. It’s the reason why I added it in the training files before w2v training… to get it, exactly as a word, in the w2v output embeddings.
Yes, but before training, it’s very difficult to choose the words to be UNK.
If I initialise the value of UNK vector randomly (or average of the rare words), it’s good or not?
If you want, for example 50000 words, you need to do a prior choice on the words you put in the dicts, or the ONMT tools will do this choice for you. Then, all words not in this 50000 dicts are becoming