Error in embeddings_to_torch.py ? (Pytorch)

tyahmed · February 22, 2018, 8:46am

I am trying to use pretrained embeddings (fastttext embeddings for French), following the same steps suggested in The FAQ page. However, the training step crushes because of an inconsistent number of features. When I investigated the problem, I found that the encoder/decoder vocabulary dimensions are swished (original encoder dimension: 50004, original decoder dimension 50002, embeddings encoder dimension: 50002). So I checked the file tools/embeddings_to_torch.py, line 24:

enc_vocab, dec_vocab = vocabs[0][1], vocabs[-1][1]

When I changed this line to : enc_vocab, dec_vocab = vocabs[-1][1], vocabs[0][1] everything works fine and the dimensions are assigned correctly. Also the matching percentage of the encoder’s vocabulary became higher, that was expected since I using the pre-trained embeddings of the source (encoder) language.

Do you think that’s a bug or I am missing something here?

Thank you !

pltrdy · February 28, 2018, 5:53pm

I guess that the order in which vocabularies are storred is kinda random, so we shouldn’t rely on the index.
I just sent a PR (https://github.com/OpenNMT/OpenNMT-py/pull/576) that instead match the vocabulary with its name.