Torch serialized pre-trained word embeddings

mahapy · February 12, 2017, 12:24pm

In the feature page for pre-trained word embeddings http://opennmt.net/Advanced/#pre-trained-embeddings it is specified that, if we want to pass a pre-trained word embedding to the -pre_word_vecs_enc option, it should be a manually constructed torch serialized matrix corresponding to src and tgt dictionary files.

Any hint on how to do this for someone who is not that familiar with torch?

Thanks

senisioi · February 12, 2017, 2:30pm

Hi, you might want to check this https://github.com/OpenNMT/OpenNMT/pull/54/ pull request that implements a script to convert word2vec or glove embeddings to torch t7 format.

mahapy · February 12, 2017, 9:13pm

Thank you. The embedding_convert.lua script did the job if given the dictionary extracted from the training set. I got instead an error ‘inconsistent tensor size’ at training time if the dictionary contains also words not appearing in the training set.

srush · February 13, 2017, 3:24pm

@senisioi is that PR ready to merge? Have you run into any issues with it?

senisioi · February 13, 2017, 8:19pm

@mahapy are you sure the errors appear when words are not found? I would expect those to appear when the size of the pre-trained embedding is inconsistent with the size parameters set in the model
@srush I did some changes on top of it, but I didn’t use it as it is. It should be alright except for some minor adjustments.