I think that -embed_file data/GoogleNews-vectors-negative300.bin means the file generated from word2vec.
But what is the -dict_file data/demo.src.dict means?
Is that the *.dict file I generate from preprocessing?
I can’t understand well what is the relation of preprocessing and embeddings.
Please tell me any advice about embeddings.
The script will iterate on the embedding file and assign the pretrained vector to each word in the vocabulary. If a word in the vocabulary does not have a corresponding pretrained vector, it is assigned to a random vector.
After this process, do I have to put these options when i train my files?
-pre_word_vecs_enc <string> (default: '')
Path to pretrained word embeddings on the encoder side serialized as a Torch tensor.
-pre_word_vecs_dec <string> (default: '')
Path to pretrained word embeddings on the decoder side serialized as a Torch tensor.
I try use word embeddings from GoogleNews-vectors-negative300.bin with -embed_type word2vec-bin.
And have this result for my english vocabulary.
[08/21/18 15:41:34 INFO] … 3000000 embeddings processed (2157/9004 matched with the dictionary)
[08/21/18 15:41:34 INFO] * 2157/9004 embeddings matched with dictionary tokens
[08/21/18 15:41:36 INFO] * 0 approximate lookup
[08/21/18 15:41:36 INFO] * 6847/9004 vocabs randomly assigned with a normal distribution
And have this result for my english vocabulary.
[08/21/18 15:44:08 INFO] … 1800000 embeddings processed (8243/9004 matched with the dictionary)
[08/21/18 15:44:16 INFO] * 8243/9004 embeddings matched with dictionary tokens
[08/21/18 15:44:16 INFO] * 0 approximate lookup
[08/21/18 15:44:16 INFO] * 761/9004 vocabs randomly assigned with a normal distribution
After i train my en-ru model with predefined word embeddings. And i have bad translation result.
For example: car. > Я .
But in this time i train another model with same source and without predefined word embeddings.
For example: car. > машина.