Why exclude last target from inputs?

nikhilweee · May 22, 2018, 10:11am

In the pytorch version of OpenNMT, the last target is excluded from the inputs to the decoder. What is the reason behind doing so?

github.com

OpenNMT/OpenNMT-py/blob/0ecec8b4c16fdec7d8ce2646a0ea47ab6535d308/onmt/Models.py#L583


         a target sequence of size `[tgt_len x batch]`.
    lengths(:obj:`LongTensor`): the src lengths, pre-padding `[batch]`.
    dec_state (:obj:`DecoderState`, optional): initial decoder state
Returns:
    (:obj:`FloatTensor`, `dict`, :obj:`onmt.Models.DecoderState`):


         * decoder output `[tgt_len x batch x hidden]`
         * dictionary attention dists of `[tgt_len x batch x src_len]`
         * final decoder state
"""
tgt = tgt[:-1]  # exclude last target from inputs


enc_final, memory_bank = self.encoder(src, lengths)
enc_state = \
    self.decoder.init_decoder_state(src, memory_bank, enc_final)
decoder_outputs, dec_state, attns = \
    self.decoder(tgt, memory_bank,
                 enc_state if dec_state is None
                 else dec_state,
                 memory_lengths=lengths)
if self.multigpu:

At every iteration i during decoding, let the decoder cell dec[i] receive an input inp[i] to produce an output out[i]. I earllier suspected that tgt[-1] is not needed because we’re feeding in tgt[i-1] as inp[i] at every iteration i. So dec[0] gets some input inp[0], dec[1] receives some combination of tgt[0] as inp[1] and so on.

Assuming the last iteration is t, this way the last decoder cell, dec[t] receives tgt[t-1] as inp[t], so there’s no need of the last target tgt[t]. But it turns out I was wrong.

github.com

OpenNMT/OpenNMT-py/blob/0ecec8b4c16fdec7d8ce2646a0ea47ab6535d308/onmt/Models.py#L492


assert emb.dim() == 3  # len x batch x embedding_dim


hidden = state.hidden
coverage = state.coverage.squeeze(0) \
    if state.coverage is not None else None


# Input feed concatenates hidden state with
# input at every time step.
for i, emb_t in enumerate(emb.split(1)):
    emb_t = emb_t.squeeze(0)
    decoder_input = torch.cat([emb_t, input_feed], 1)


    rnn_output, hidden = self.rnn(decoder_input, hidden)
    decoder_output, p_attn = self.attn(
        rnn_output,
        memory_bank.transpose(0, 1),
        memory_lengths=memory_lengths)
    if self.context_gate is not None:
        # TODO: context gate should be employed
        # instead of second RNN transform.
        decoder_output = self.context_gate(

This code suggests that inp[i] is some combination of tgt[i] and not tgt[i-1]. If this is the case, why do we drop the last target?

nikhilweee · May 22, 2018, 11:32am

So it turns out that tgt[0] is the start of sequence tag <s> which is passed on to dec[0]. tgt[-1] will either be the end of sequence tag </s> or <blank>. In either case they don’t need to be feeded into the decoder again.

Aside: While computing the loss, the outputs should be compared against tgt[1:]. This is indeed the case.

github.com

OpenNMT/OpenNMT-py/blob/0ecec8b4c16fdec7d8ce2646a0ea47ab6535d308/onmt/modules/CopyGenerator.py#L167


                                            self.padding_idx)


def _make_shard_state(self, batch, output, range_, attns):
    """ See base class for args description. """
    if getattr(batch, "alignment", None) is None:
        raise AssertionError("using -copy_attn you need to pass in "
                             "-dynamic_dict during preprocess stage.")


    return {
        "output": output,
        "target": batch.tgt[range_[0] + 1: range_[1]],
        "copy_attn": attns.get("copy"),
        "align": batch.alignment[range_[0] + 1: range_[1]]
    }


def _compute_loss(self, batch, output, target, copy_attn, align):
    """
    Compute the loss. The args must match self._make_shard_state().
    Args:
        batch: the current batch.
        output: the predict output from the model.