Attention on a specific word in the context

Hi Wabbit - thanks. give me one day and I will come back to you!
what I would like is to make sure we change as little as possible in the flow. (interestingly when we worked on modularizing opennmt - our internal example we took for testing the code organization was about changing the attention module - so it is good time to practice on that!)

@jean.senellart - I’m not sure what’s the priority for this in the team’s plan so I was thinking I could implement (maybe by breaking the modularity in my own branch) and get it code reviewed by someone more familiar with the setup. I need to move fast on it for one of my projects.

If that sounds ok please let me know if the changes in my earlier post are fine. Also I tried using mobdebug to understand the flow but breakpoints in GlobalAttention.lua are never triggered. What setup do you folks use?

Hi @Wabbit, I commited a draft implementation here: https://github.com/jsenellart-systran/OpenNMT/tree/HardAttention - so that you can move ahead.

What I did is the standard way: we are adding an additional tensor to the inputs of the decoder which then reaches attention. The initialization of this tensor for the moment is set to 1 for position t, and zero otherwise (note that the dimension of this tensor is batch_size*source_length - and timestep t is iterating on target sentence so it might go beyond the source sentence length).

It is not cleaned - I just modified the class GlobalAttention while you should create a new class HardAttention. Also, we need to condition the selection of the GlobalAttention or HardAttention when building decoder, and also for the passing of this additional input.

We would like to find a less intrusive implementation - but this is on our side! on yours, you can for the moment go ahead with this, it should do the work.

let me know if you have any question!

(Let’s not call this hard attention in the final version. Maybe IdentityAttention or InOrderAttention. )

I agree - I was thinking about FixedAttention - since it will allow to inject soft alignment from external alignment tool.

1 Like

@jean.senellart Thanks for the FixedAttention branch !

@jean.senellart: I’m having problems with generating translations on the master branch after I merged in https://github.com/jsenellart-systran/OpenNMT/tree/HardAttention

I also tried running translate using code in https://github.com/jsenellart-systran/OpenNMT/tree/HardAttention and even that has the same problem.

In your code you have removed softmaxAttn since HardAttention is just a vector of a single 1 and other 0s. So, I had to modify DecoderAdvancer.lua:80 as

  local softmaxOut = _

Also, the HardAttention code needs the position t, so I had to pass it in DecoderAdvancer.lua:78

decOut, decStates = self.decoder:forwardOne(inputs, decStates, context, decOut, t)

I’m still trying to understand the entire flow but could you please confirm if these changes make sense? LMK if you want to see the files on github so that you can diff.

for adding t parameter you are right, for the softmaxOut - we need to do something little bit different: we do need to get the fixedAttn instead so that the beam search goes fine. I am in plane for the next 11 hours but will try to commit a patch when I land.

fixedAttention is of dimensions --batchL X sourceL
What if we pass fixedAttention through a softmax like:



  local softmaxAttn = nn.SoftMax()
  softmaxAttn.name = 'softmaxAttn'
  attn = softmaxAttn(fixedAttention)--batchL X sourceL
  attn = nn.Replicate(1,2)(attn) -- batchL x 1 x sourceL

Then we can this SoftmaxAttn in DecoderAdvancer.lua:80 as

local softmaxOut = self.decoder.softmaxAttn.output

The only issue is that exp(0)=1 and exp(1)=2.71 are not very different so we might ideally want FLOAT_MIN that torch allows instead of 0 so that attention goes to almost 0 at the right places.

Hi @Wabbit - I put a patch on the branch, I am just getting the fixed attention vector. it seems very hard to test though because for regular translation task, we cannot really expect any good result. Do you have a specific use case for that?
Note also that source target sentence has a <BOS> token - so if you count on aligning manually source and target, you need to take this into account.

My task is (potentially) simpler. It’s about tagging each word in the source to say whether it represents a particular kind of entity-essentially Named Entity Recognition.

An example source-target pair will be:
Source: This shirt is blue
Target 0,0,0,1

This assumes that we are tagging for color . The motivation for fixed attention here is that there’s a 1:1 alignment between source: target (unlike the case of MT). Fixed Attention is a way of pushing in more information to the decoder ( beyond what’s captured by the final state of the encoder RNN)

Thanks for the explanation. I think you should explore something simpler than seq2seq here - even with fixed attention, you won’t be able to control target sentence length and encoder-decoder-attn approach is overkilling.
Basically what you need is a sequence-tagger - so you only need encoding layer+softmax - just like language model implementation, except that target is your NER tags. I am copying @josep.crego who is starting working exactly on that on our side.

I agree that it might be an overkill but I need to capture across-tag dependencies better than what a linear chain CRF can do.
Using OpenNMT I’m actually getting better results for tagging than Neural Architectures for Named Entity Recognition

Does the standard implementation of the language model in OpenNMT specify the length of the sequence to be generated, a-priori ( as an external input)? If not it’ll have the same same issues as the MT model.
One idea is to specify the length of the sentence for each data point and during beam search prune beams which don’t end with EOS at the sentence length. The flip side is that some of these extra long sentences might actually be more attractive for the cross-entropy loss function and we might lose them

Btw. since this is connected to beam search do you have any views on http://forum.opennmt.net/t/padding-or-other-changes-from-older-commit/221/3?u=wabbit

Do you simply use raw 0/1 codes as target ? Or, are you using an output feature, like the one used to predict the case?
Source: This shirt is blue
Target: This|0 shirt|0 is|0 blue|1

Possibly, having a real sentence to produce could certainly help ONMT to work, and could help you confirm afterwards the matches between source and target.

@Etienne38- I simply use the raw 0/1 codes as target.

I do not quite understand how predicting a real sentence might help. If I ask it to predict both the word and the tag like Blue|1 instead of 1, does OpenNMT handle it like a multi-task problem with separate RNNs for each task or does it create labels which are combinations of word*task like Blue*1, Blue*0 etc?

At first glance it seems that the task of creating a sequence of 0/1 should be much easier since there are only 2 options to choose from compared to |vocab| in the MT case.

Nevertheless I’ll try as you say and see what I get.

I’m really interested in knowing your results. I will certainly make my own experimentations on such a subject in a near future.

My idea is: RNN doesn’t like repetitions, and, on the decoder side, with only 0/1 codes, you give it a very poor information in the RNN loop to decide of a given value. Adding the words in the output, you will provide it with a richer context, without repetitions : the 0/1 repetitions should then be simple components values in a richer space.

Even if the words are not well predicted, it could be sufficient to produce good 0/1 values. If the words are well predicted, event with a bad number of tokens, you well be able to build a pertinent post-analyze to decide what are the input words to tag with the 0/1 output.

It depends on the OpenNMT implementation-if the predicted word E.g. This is fed back into the RNN for predicting the next 0/1 then yes we are providing good amount of information through the language model.

As far as I know, there is only one RNN, with only one output vector coding both the words and the optional features.

Currently, target features act like annotations. In practice the target features sequences are shifted relative to the word sequences so at timestep t:

  • the input is: words[t], features[t - 1]
  • the output is: words[t + 1], features[t]

The prediction of the features directly depends on the word they annotate.

1 Like

A post was split to a new topic: Remove target features shifting