Teacher forcing

I wanted to implement teacher forcing as in Pascanau et. al. JMLR so that we feed in the predicted previous label $y_hat_{t}$ instead of the ground-truth label $y_{t}$ . Any plans for adding this feature? I can work on the code if you provide pointers.

Note: I don’t know lua but am willing to try if I get some pointers

Hello! lua is very easy to jump in, don’t worry about this part! The idea seems close to distillation as in Sequence-Level Knowledge Distillation (Kim and Rush, EMNL 2016), isn’t it? Today we are doing distillation but as 2 steps process: first we train a model using ground truth (teacher), and then we train a second model (student) using teacher output - see Neural Machine Translation from Simplified Translations.

Do you have something different in mind? I can help implementing.

If I understand Sequence-Level Knowledge Distillation correctly, the idea there is to 1) generate simplified translations using a “teacher” network and then 2) train another (student) model with the simplified translations as the target sentences.

My use case is different. I want to tag all words in a sentence as color or person or location. During training I’ll have labels like:
Raw sentence:John painted a blue sky
Labels : John: Person, painted: none, ‘a: none’, ‘blue’: color, sky:none

During prediction I have partially labelled sentences ( say the label for color is available but not for person etc.)
Input sentence during prediction: Jane's dress is brown
Labels available : Jane's: not available, dress: not available, is: not available, brown: color

During beam search I want to force the label for brown to be color since I know the label for that word.

Note: this kind of partially labeled data arises frequently in practice atleast for tagging tasks. For MT too, if you know part of the translation from human annotators with high confidence and are trying to improve upon it this feature might be useful.


  1. Is the use case clear enough from my description?
  2. Is this kind of feature planned?
  3. If not can you give me pointers to the right files that should be modified?

OpenNMT’s decoder is already trained by teacher forcing.

So if I understand correctly your request, you want to know when—at test time—you can replace the predicted label by the one you know is true. Is that right?

To explain more:

  1. In my application ( refer my earlier post) I know in advance the length of my target sentence.
  2. In beam search at test time when I want to explore options for position t the following pseudocode is what I want:
func beam_search(source, partial_targets):
beam=[<s>]
 for i in len(source):
        if partial_targets(i)==True:
            beam=concat(beam, partial_targets(i)) # Note I don't need the cross product here
        else:
            beam=top_k(cross_prod(beam, curr_options))
return beam
                 

Here beam is the maintained beam and partial_targets is a bit vector of size len(source) with partial_targets(i)=True for positions whose targets are known

The beam search implementation will shortly support filters over the hypotheses so that you can prune inconsistent labels. Here is an example that discards sentences with too many unknowns:

So it is a bit the negation of your problem (discarding instead of forcing) but you could make use of it. However, we currently have no plans to implement the feature as described in the pseudo-code.

May I ask why you chose to use a sequence-to-sequence model to do sequence tagging?

In my use case I have to tag paragraphs which might be around 200 words long. I used the Sequence-to sequence model since I can relax the Markovian assumption (probability of generating the $y_t$ is not independent of $y_1$, $y_2$ etc.) and hence I get structured prediction

Hi Jean,

Thanks for your reply. Could you point out the code corresponding to teacher-student model for me? I haven’t found it myself yet

Best,
Xiao


How exactly are you using teacher forcing ? It’s not clearly mentioned in the documentation.

At training time, the input of the decoder is simply the reference and not the predicted target. What details would you like to know?

1 Like

By the way, I have implemented the opposite called “scheduled sampling” following Bengio, 2015. It is available here for testing:

Different mode are available: token or sentence level, and decay can be linear or inverse sigmoid as detailed in the paper.

It seems that it does not help for translation task though - which can be understood. More tests for other tasks are in progress.

Yes thats what I wanted to know if you’re sending in both predicted and reference during training. Because I read somewhere, teacher forcing helps giving the sentence correct structure but often blocks the context of the sentence.