Teacher forcing

Wabbit · January 14, 2017, 4:24am

I wanted to implement teacher forcing as in Pascanau et. al. JMLR so that we feed in the predicted previous label $y_hat_{t}$ instead of the ground-truth label $y_{t}$ . Any plans for adding this feature? I can work on the code if you provide pointers.

Note: I don’t know lua but am willing to try if I get some pointers

jean.senellart · January 14, 2017, 7:26am

Hello! lua is very easy to jump in, don’t worry about this part! The idea seems close to distillation as in Sequence-Level Knowledge Distillation (Kim and Rush, EMNL 2016), isn’t it? Today we are doing distillation but as 2 steps process: first we train a model using ground truth (teacher), and then we train a second model (student) using teacher output - see Neural Machine Translation from Simplified Translations.

Do you have something different in mind? I can help implementing.

Wabbit · January 15, 2017, 12:45pm

If I understand Sequence-Level Knowledge Distillation correctly, the idea there is to 1) generate simplified translations using a “teacher” network and then 2) train another (student) model with the simplified translations as the target sentences.

My use case is different. I want to tag all words in a sentence as color or person or location. During training I’ll have labels like:
Raw sentence:John painted a blue sky
Labels : John: Person, painted: none, ‘a: none’, ‘blue’: color, sky:none

During prediction I have partially labelled sentences ( say the label for color is available but not for person etc.)
Input sentence during prediction: Jane's dress is brown
Labels available : Jane's: not available, dress: not available, is: not available, brown: color

During beam search I want to force the label for brown to be color since I know the label for that word.

Note: this kind of partially labeled data arises frequently in practice atleast for tagging tasks. For MT too, if you know part of the translation from human annotators with high confidence and are trying to improve upon it this feature might be useful.

Is the use case clear enough from my description?
Is this kind of feature planned?
If not can you give me pointers to the right files that should be modified?

guillaumekln · January 15, 2017, 7:28pm

OpenNMT’s decoder is already trained by teacher forcing.

So if I understand correctly your request, you want to know when—at test time—you can replace the predicted label by the one you know is true. Is that right?

Wabbit · January 19, 2017, 3:54pm

To explain more:

In my application ( refer my earlier post) I know in advance the length of my target sentence.
In beam search at test time when I want to explore options for position t the following pseudocode is what I want:

func beam_search(source, partial_targets):
beam=[<s>]
 for i in len(source):
        if partial_targets(i)==True:
            beam=concat(beam, partial_targets(i)) # Note I don't need the cross product here
        else:
            beam=top_k(cross_prod(beam, curr_options))
return beam

Here beam is the maintained beam and partial_targets is a bit vector of size len(source) with partial_targets(i)=True for positions whose targets are known

guillaumekln · January 19, 2017, 5:22pm

The beam search implementation will shortly support filters over the hypotheses so that you can prune inconsistent labels. Here is an example that discards sentences with too many unknowns:

github.com

da03/OpenNMT/blob/patch-beamsearch/onmt/translate/Translator.lua#L186-L201


      attn[j] = attn[j]:narrow(1, batch.sourceLength - size + 1, size)
    end


    table.insert(hypBatch, tokens)
    if #feats > 0 then
      table.insert(featsBatch, feats)
    end
    table.insert(attnBatch, attn)
    table.insert(scoresBatch, score)
  end


  table.insert(allHyp, hypBatch)
  table.insert(allFeats, featsBatch)
  table.insert(allAttn, attnBatch)
  table.insert(allScores, scoresBatch)
end

So it is a bit the negation of your problem (discarding instead of forcing) but you could make use of it. However, we currently have no plans to implement the feature as described in the pseudo-code.

May I ask why you chose to use a sequence-to-sequence model to do sequence tagging?

Wabbit · January 22, 2017, 5:54pm

In my use case I have to tag paragraphs which might be around 200 words long. I used the Sequence-to sequence model since I can relax the Markovian assumption (probability of generating the $y_t$ is not independent of $y_1$, $y_2$ etc.) and hence I get structured prediction

blackyang · February 24, 2017, 9:05pm

Hi Jean,

Thanks for your reply. Could you point out the code corresponding to teacher-student model for me? I haven’t found it myself yet

Best,
Xiao

–

Shivali · July 20, 2017, 7:15am

How exactly are you using teacher forcing ? It’s not clearly mentioned in the documentation.

guillaumekln · July 20, 2017, 8:15am

At training time, the input of the decoder is simply the reference and not the predicted target. What details would you like to know?

jean.senellart · July 20, 2017, 2:31pm

By the way, I have implemented the opposite called “scheduled sampling” following Bengio, 2015. It is available here for testing:

Different mode are available: token or sentence level, and decay can be linear or inverse sigmoid as detailed in the paper.

It seems that it does not help for translation task though - which can be understood. More tests for other tasks are in progress.

Shivali · July 25, 2017, 7:31pm

Yes thats what I wanted to know if you’re sending in both predicted and reference during training. Because I read somewhere, teacher forcing helps giving the sentence correct structure but often blocks the context of the sentence.