sorry but the late response, but these days have been so busy.
Indeed, the decoder only uses the context_vector from the encoder as information from the source sentence (besides the encoder hidden_state at the beginning of the sentence decoding to initialize the decoder state).
The encoder is called by the Translator.translate_batch function :
enc_states, memory_bank = self.model.encoder(src, src_lengths)
so, the context vector, if I am not mistaken, is in the memory_bank variable.
where, afterwards, it is performed the beam search process. Where the decoder is called beamsize*src_sentence_token times. And it is computed a vector of batch x beam_word scores by means of the generator. So, for each beam of the search we have a vector with the probability for each word in the target vocabulary of being the next word translation.
Here is where “the magic” is done. When the beam search is done, the probabilities are combined to produce the most probable target sequence.
Recall that the decoder is producing a target sequence more than translating the source sentence word by word.
So, for your example “John eats ice cream” we will see something like that when looking at the beams tokens and its probabilities, just assume we are pursuing a 5beam search:
So, from the beam search graph you can see how the most probable sequence is "Juan come helado . " followed by "Juan comió helado . " and so on . The decoder have learned to produced “helado” as the 3rd most probable word when processing “John eats ice cream” in the source sentence and after producing “Juan come” in the previous 2 steps.
- The Encoder is called by the Translator before calling the Decoder and the variables “memory-bank” are storing the context vector.
- The context vector is created by the Encoder, and it is actually the second output from the Encoder.forward. It is called “memory bank for attention” (memory_bank variables) in the PyTorch code.
- The context vector is a summary of the source sentence and the attention vectors will give you the relationship among the produced word and the source words.
- The Decoder is generating a sequence of target words by taking into account the representation of a sequence of source words inside a beam search process.
Hope that can help!