I’d like to use pre trained word2vec for this model, but I meet a problem,
For a source sentence or target sentence for training, some words are not contained in the pre-trained word2vec model, so it will be set as UNK and how can I initialize the vector for UNK?
build my own dict files from the data with my own criterion. Don’t forget to add <s> </s> <unk> <blank> tags in the dict files. For this step, you can also use ONMT tools
pre-process the training files, adding <s> and </s> tags, and replacing all word not in the dict files by <unk>
w2v training on preprocessed files, saving the result as glove files
For me, you can’t. It will be assigned a random value. It’s the reason why I added it in the training files before w2v training… to get it, exactly as a word, in the w2v output embeddings.
If you want, for example 50000 words, you need to do a prior choice on the words you put in the dicts, or the ONMT tools will do this choice for you. Then, all words not in this 50000 dicts are becoming <unk>.