First off, this is a great package and I have been able to build lots of interesting projects using it. Many thanks for that!
Secondly, I recently modified parts of the code in my local copy so that
translate.lua can produce two additional outputs:
Embeddings for input sentences; this is different from
tools/extract_embeddings.lua (which outputs word embeddings) in that, for a given model and input text, we output the final hidden state from the encoder that is subsequently conditioned on in the decoder. This allows us to cluster and analyze sentences/paragraphs using the produced vector representations.
I added a flag
-print_n_best that prints to STDOUT the n-best list and not just the 1-best translation. The
n_best flag seems to make the decoder consider the n-best list during beam search and prints out the n-best outputs to STDERR, but not to STDOUT.
If there are easy fixes to the above two issues then that’s great. Otherwise, I can submit a pull request to add these features? One thing to note is that this is my first time writing Lua, and my guess is that the request will have to go through a few iterations before being accepted
Thank you for your interest in OpenNMT.
For 1., could it be a separate script like
tools/extract_embeddings.lua? If so please do send a PR.
For 2., do you mean the
n_best are not written in the file? I agree they are somewhat useless when using
translate.lua. They should be written in the file in my opinion. You can submit a PR for that to.
Sounds good - I’m wondering if we could just modify the
extract_embeddings.lua script such that, when the
-src flag is defined, we compute sentence embeddings? If so, I’ll just do that. I’ll need to take a look because I made my changes before the major refactoring of OpenNMT. Otherwise, I’ll try and set this up as a separate script (right now, it’s based on the old version of
Yes, it does not seem that when the
n_best flag is on that the n-best outputs are written to the file, only the one-best. I’ll make a change to
translate.lua for this too.
Thanks for the guidance!
Mmh, maybe we should just rely on the existing
translate.lua process and add a flag like
-dump_encoder_state <file>. It should just be a few lines of code, right?
Also, do you serialize in a specific format?
Yup that’s exactly what I did - I added a flag called
print_input_embeddings. I serialize in plain text.
Here’s the PR: https://github.com/OpenNMT/OpenNMT/pull/176
I actually have a question on the two things mentioned here.
is it the case that unless the -tgt flag (variable in the configs) is empty the script will not run? I got the following mistake when I had tgt defined and “-dump_input_encoding true”:
…luajit: translate.lua:146: attempt to perform arithmetic on field ‘goldScore’ (a nil value)
translate.lua:146: in function 'main’
translate.lua:239: in main chunk
[C]: in function ‘dofile’
…/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405d50
This was no longer a problem when -tgt was empty.
As a follow-up question, the embeddings that are dumped can in theory be used with t-SNE?
I just came across the paper “Diverse beam search: Decoding diverse solutions from neural sequence models” (Vijayakumar et al. 2016), I was wondering if the current beam search function works similarly. I plan to read the Wiseman et al 2016 paper (on which I think the current OpenNMT beam search is based on) to compare. From the looks of the output when using “-n_best 5” some options I get do have different beginnings.
Special thanks to @asaluja for the features!