Attention on a specific word in the context

I agree that it might be an overkill but I need to capture across-tag dependencies better than what a linear chain CRF can do.
Using OpenNMT I’m actually getting better results for tagging than Neural Architectures for Named Entity Recognition

Does the standard implementation of the language model in OpenNMT specify the length of the sequence to be generated, a-priori ( as an external input)? If not it’ll have the same same issues as the MT model.
One idea is to specify the length of the sentence for each data point and during beam search prune beams which don’t end with EOS at the sentence length. The flip side is that some of these extra long sentences might actually be more attractive for the cross-entropy loss function and we might lose them

Btw. since this is connected to beam search do you have any views on http://forum.opennmt.net/t/padding-or-other-changes-from-older-commit/221/3?u=wabbit

Do you simply use raw 0/1 codes as target ? Or, are you using an output feature, like the one used to predict the case?
Source: This shirt is blue
Target: This|0 shirt|0 is|0 blue|1

Possibly, having a real sentence to produce could certainly help ONMT to work, and could help you confirm afterwards the matches between source and target.

@Etienne38- I simply use the raw 0/1 codes as target.

I do not quite understand how predicting a real sentence might help. If I ask it to predict both the word and the tag like Blue|1 instead of 1, does OpenNMT handle it like a multi-task problem with separate RNNs for each task or does it create labels which are combinations of word*task like Blue*1, Blue*0 etc?

At first glance it seems that the task of creating a sequence of 0/1 should be much easier since there are only 2 options to choose from compared to |vocab| in the MT case.

Nevertheless I’ll try as you say and see what I get.

I’m really interested in knowing your results. I will certainly make my own experimentations on such a subject in a near future.

My idea is: RNN doesn’t like repetitions, and, on the decoder side, with only 0/1 codes, you give it a very poor information in the RNN loop to decide of a given value. Adding the words in the output, you will provide it with a richer context, without repetitions : the 0/1 repetitions should then be simple components values in a richer space.

Even if the words are not well predicted, it could be sufficient to produce good 0/1 values. If the words are well predicted, event with a bad number of tokens, you well be able to build a pertinent post-analyze to decide what are the input words to tag with the 0/1 output.

It depends on the OpenNMT implementation-if the predicted word E.g. This is fed back into the RNN for predicting the next 0/1 then yes we are providing good amount of information through the language model.

As far as I know, there is only one RNN, with only one output vector coding both the words and the optional features.

Currently, target features act like annotations. In practice the target features sequences are shifted relative to the word sequences so at timestep t:

  • the input is: words[t], features[t - 1]
  • the output is: words[t + 1], features[t]

The prediction of the features directly depends on the word they annotate.

1 Like

A post was split to a new topic: Remove target features shifting