Torch serialized pre-trained word embeddings

In the feature page for pre-trained word embeddings http://opennmt.net/Advanced/#pre-trained-embeddings it is specified that, if we want to pass a pre-trained word embedding to the -pre_word_vecs_enc option, it should be a manually constructed torch serialized matrix corresponding to src and tgt dictionary files.

Any hint on how to do this for someone who is not that familiar with torch?

Thanks

Hi, you might want to check this https://github.com/OpenNMT/OpenNMT/pull/54/ pull request that implements a script to convert word2vec or glove embeddings to torch t7 format.

1 Like

Thank you. The embedding_convert.lua script did the job if given the dictionary extracted from the training set. I got instead an error ‘inconsistent tensor size’ at training time if the dictionary contains also words not appearing in the training set.

@senisioi is that PR ready to merge? Have you run into any issues with it?

@mahapy are you sure the errors appear when words are not found? I would expect those to appear when the size of the pre-trained embedding is inconsistent with the size parameters set in the model
@srush I did some changes on top of it, but I didn’t use it as it is. It should be alright except for some minor adjustments.