Sentence Embeddings for English


I am trying to use the OpenNMT-py to create sentence embeddings for the image captions of MSCOCO2015 dataset. In order to do so, I am trying to use the tutorial on using pretrained GloVe word embeddings to “translate” sentences back to their-selves.

First of all, does this make sense or is there a better approach to create sentence embeddings like this?
Are there any suggestions on which models and hyper-parameters to use?


Hi @CtrlAltV!

what do you mean with using GoVe word embeddings to “translate” sentences back to their-selves?

As I see it, you can get sentence embeddings by:

  1. summing/averaging word embeddings from the words in the sentence
  2. using an encoder: for each sentence the encoder produce an h vector that summarizes the sentence information, like they explain here: Sentence embeddings and n-best lists
  3. training a sentence embedding model: like the one proposed by Mikolov here and use that sentence representation.

If I wanted to have a good representation of the image captions, I would train a good language model, on a big and representative English data set and then I would get the sentences representations from that language model.
Remind that a language model is “just” an encoder trained on monolingual data in order to predict the following word of a sentence.

Hope that can help.

1 Like

Hello @emartinezVic,

Thanks for your reply! By translating back to their-selves, I meant using the provided by OpenNMT-py Encoder/Decoder architectures in an AutoEncoder fashion to obtain a latent compressed vector as the representation of the sentences.

Ultimately, I want to pre-train two models and then re-use them in a different architecture: one Sentence to Vector Encoder and one Vector to Text Decoder.

I will read the resources you provided.

Thank you very much,

1 Like