Question about the implementation of decoding

rylanchiu · July 30, 2018, 3:19am

In translator.py, only the last predicted tokens are feed in the decoder at each step of decoding. This may work in RNN-based decoder. But in the transformer, as far as I know, the decoder need to attend to all the previous target-side tokens in decoding period. So which part of the code can tackle this issue? Please tell me if I had any misunderstanding. Thanks in advance!

guillaumekln · July 30, 2018, 6:58am

Previous timesteps are cached in the TransformerDecoder instance. There are 2 modes:

cache the all previous inputs and concatenate them before running the self attention layer
cache the all previous projected keys and values and concatenate them before running the dot product attention

The second mode was recently added and is of course faster. See:

rylanchiu · July 30, 2018, 7:27am

Thanks for your reply. But I still have the problem of the first model. Could you please point me to the code that cache the previous inputs? Seems that I could not find them. As far as I know, call TransformerDecoder() once equals to one decoding step. But in the trainslator.py, I could not find any value is passed to cache .

guillaumekln · July 30, 2018, 7:33am

Read cached inputs of each layer:

github.com

OpenNMT/OpenNMT-py/blob/master/onmt/decoders/transformer.py#L75-L77


if previous_input is not None:
    all_input = torch.cat((previous_input, input_norm), dim=1)
    dec_mask = None

Write all layers input in the cache:

github.com

OpenNMT/OpenNMT-py/blob/master/onmt/decoders/transformer.py#L234-L235


if state.cache is None:
    state = state.update_state(tgt, saved_inputs)