Use GloVe with concatenated word features (OpenNMT-py)

davidstap · December 5, 2018, 6:43am

I want to use GloVe embeddings concatenated with features: documents in my .src file consist of w|f1|f2|f3|f4, and I want w to be a GloVe embedding, whereas the four features should be concatenated. How do I do this?

When I create a vocabulary with tools/embeddings_to_torch.py given the w|f1|f2|f3|f4 .src file, the keys look like w|f1|f2|f3|f4 (e.g. winning|VERB|amod|NONE|O), but I would like the keys to be words (e.g. winning). Otherwise, it does not make sense to use GloVe embeddings since these w|f1|f2|f3|f4 keys will not match with GloVe word vectors.

My current approach is as follows: I created a vocabulary using a .src file without features (only w, i.e. without features, but otherwise identical), and used the GloVe embeddings based on the vocabulary of this file. For training I use the w|f1|f2|f3|f4 .src files and the w GloVe embeddings. However, I am not sure if this has the desired effect (probably not). During training, I use -feat_merge concat and -feat_vec_exponent 0.7.