In translator.py
, only the last predicted tokens are feed in the decoder at each step of decoding. This may work in RNN-based decoder. But in the transformer, as far as I know, the decoder need to attend to all the previous target-side tokens in decoding period. So which part of the code can tackle this issue? Please tell me if I had any misunderstanding. Thanks in advance!
Previous timesteps are cached in the TransformerDecoder
instance. There are 2 modes:
- cache the all previous inputs and concatenate them before running the self attention layer
- cache the all previous projected keys and values and concatenate them before running the dot product attention
The second mode was recently added and is of course faster. See:
1 Like
Thanks for your reply. But I still have the problem of the first model. Could you please point me to the code that cache the previous inputs? Seems that I could not find them. As far as I know, call TransformerDecoder()
once equals to one decoding step. But in the trainslator.py
, I could not find any value is passed to cache
.
Read cached inputs of each layer:
Write all layers input in the cache:
1 Like