Sentence embeddings and n-best lists

Hi all,

First off, this is a great package and I have been able to build lots of interesting projects using it. Many thanks for that!

Secondly, I recently modified parts of the code in my local copy so that translate.lua can produce two additional outputs:

  1. Embeddings for input sentences; this is different from tools/extract_embeddings.lua (which outputs word embeddings) in that, for a given model and input text, we output the final hidden state from the encoder that is subsequently conditioned on in the decoder. This allows us to cluster and analyze sentences/paragraphs using the produced vector representations.

  2. I added a flag -print_n_best that prints to STDOUT the n-best list and not just the 1-best translation. The n_best flag seems to make the decoder consider the n-best list during beam search and prints out the n-best outputs to STDERR, but not to STDOUT.

If there are easy fixes to the above two issues then that’s great. Otherwise, I can submit a pull request to add these features? One thing to note is that this is my first time writing Lua, and my guess is that the request will have to go through a few iterations before being accepted :slight_smile:



Thank you for your interest in OpenNMT.

For 1., could it be a separate script like tools/extract_embeddings.lua? If so please do send a PR.

For 2., do you mean the n_best are not written in the file? I agree they are somewhat useless when using translate.lua. They should be written in the file in my opinion. You can submit a PR for that to.

Hi Guillaume,

  1. Sounds good - I’m wondering if we could just modify the extract_embeddings.lua script such that, when the -src flag is defined, we compute sentence embeddings? If so, I’ll just do that. I’ll need to take a look because I made my changes before the major refactoring of OpenNMT. Otherwise, I’ll try and set this up as a separate script (right now, it’s based on the old version of translate.lua).

  2. Yes, it does not seem that when the n_best flag is on that the n-best outputs are written to the file, only the one-best. I’ll make a change to translate.lua for this too.

Thanks for the guidance!

Mmh, maybe we should just rely on the existing translate.lua process and add a flag like -dump_encoder_state <file>. It should just be a few lines of code, right?

Also, do you serialize in a specific format?

Yup that’s exactly what I did - I added a flag called print_input_embeddings. I serialize in plain text.

Here’s the PR:

I actually have a question on the two things mentioned here. :slight_smile:

  1. -dump_input_encoding
    is it the case that unless the -tgt flag (variable in the configs) is empty the script will not run? I got the following mistake when I had tgt defined and “-dump_input_encoding true”:

…luajit: translate.lua:146: attempt to perform arithmetic on field ‘goldScore’ (a nil value)
stack traceback:
translate.lua:146: in function 'main’
translate.lua:239: in main chunk
[C]: in function ‘dofile’
…/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405d50

This was no longer a problem when -tgt was empty.

As a follow-up question, the embeddings that are dumped can in theory be used with t-SNE?

  1. -n_best
    I just came across the paper “Diverse beam search: Decoding diverse solutions from neural sequence models” (Vijayakumar et al. 2016), I was wondering if the current beam search function works similarly. I plan to read the Wiseman et al 2016 paper (on which I think the current OpenNMT beam search is based on) to compare. From the looks of the output when using “-n_best 5” some options I get do have different beginnings.

Special thanks to @asaluja for the features!