Multiple tokens in Source to single token in Target


(Duane Dougal) #1

I’m using OpenNMT-py and it is working very well on EN->ES. Thanks for making this code available!

I have a question: I’m trying to find and understand in the code where multiple tokens in the source (EN) map to either a single token or multiple tokens in the target (ES). For example, if I have the sentence, “John eats ice cream” and a translation in Spanish, “Juan come helado,” where in the code does it map “ice cream” to “helado” (a multiple token concept in the source to a single token concept in the target)?

I understand the NMT concepts about the encoder creating a semantic context vector and the decoder producing a translation from the semantic context vector. I can see in the _build_target_tokens def in where tokens are selected from the target vocabulary and combined to create a translation. However, I want to find in the code where “ice cream” gets distilled into the semantic context vector as a single semantic entity (even though it is represented by two tokens in EN). I can also see in the translate_batch def in where the encoder is run on the source. Where is the context vector generated and how do I read the context vector?

(Eva) #2

Hi Duane!

this is an interesting question :slight_smile:
This source-target mapping is resumed in the attention vector.
I don’t know very much the distribution of the python code but I think that I can give you some hints to find the information you want.

The context vector is generated by the encoder, in fact, it is the output of the encoder after processing the source sentence. This context vector encloses the information of the entire source sentence, and it is given to the decoder for the decoding process.

The decoder uses this context vector from the source sentence in combination with the previous generated word (and sometimes its previous state) to generate the next translated word.
In particular, this context vector is fed to the attention module (integrated into the decoder) who is in charge to calculate the of the current translated word being a translation from each one of the source sentence words.
If you look at the attention weights after translating your sentence “John eats ice cream” you should see that the probability that relates helado and ice is very similar to the one for helado and cream.

Looking at the code, you can see in the translate function from, how the attention vectors (attn) are obtained from the decoder, and then they are used for the ‘copy’ mechanism during translation (this is, when an unknown word is generated, copy the source word where it comes from). Also, you can find there a Debug attention section where you can see an example of how to access to the attention information.

Hope this can help.
Good luck!

(Terence Lewis) #3

Excellent explanation for me too, Eva :slight_smile:

(Duane Dougal) #4

Thank you very much, Eva. This is helpful. :blush:

However, I have an additional question. When you say that the “probability that relates helado and ice is very similar to the one for helado and cream”, why wouldn’t that produce “Juan come helado helado”? NMT isn’t really comparing the human readable words. There’s still the issue of a potential mismatch between the number of tokens between source and target that I’m trying to understand and find in the code. I need to find where the source words “ice” and “cream” are combined together into a single semantic object that equates to the target “helado.” I presume that happens in the encoder.

As you’ve described, NMT in general works like this:

src_sentence > encoder > context_vector > decoder > tgt_sentence

Although I can see in the code where the encoder is called, I am having trouble finding the resulting context vector (or which variable or object contains the context vector–there doesn’t seem to be anything called “context vector” in the code). Lots of data are produced by the encoder. Isn’t it true that he decoder looks at the context vector and not the source sentence? So by the time the context vector is produced, it appears that the number of tokens issue has already been resolved and the context vector contains semantic objects and not individual (human) words. I need to be able to watch the creation of the context vector. Can you tell me where specifically in the code that is?

(Eva) #5

Hi Duane,
sorry but the late response, but these days have been so busy.

Indeed, the decoder only uses the context_vector from the encoder as information from the source sentence (besides the encoder hidden_state at the beginning of the sentence decoding to initialize the decoder state).

The encoder is called by the Translator.translate_batch function :

enc_states, memory_bank = self.model.encoder(src, src_lengths)

so, the context vector, if I am not mistaken, is in the memory_bank variable.

where, afterwards, it is performed the beam search process. Where the decoder is called beamsize*src_sentence_token times. And it is computed a vector of batch x beam_word scores by means of the generator. So, for each beam of the search we have a vector with the probability for each word in the target vocabulary of being the next word translation.

Here is where “the magic” is done. When the beam search is done, the probabilities are combined to produce the most probable target sequence.
Recall that the decoder is producing a target sequence more than translating the source sentence word by word.
So, for your example “John eats ice cream” we will see something like that when looking at the beams tokens and its probabilities, just assume we are pursuing a 5beam search:

Juan (0.60) John (0.50) él(0.10) el(0.05) chico(0.10)
come (0.50) comes(0.45) comió(0.49) comer (0.40) comerá (0.25)
helado (0.53) helados(0.50) polo(0.48) crema(0.40) hielo (0.30)
. (0.8) . (0.8) .(0.8) helada(0.10) crema (0.10)
EOS (0.8) EOS (0.8) EOS (0.8) . (0.8) . (0.8)

So, from the beam search graph you can see how the most probable sequence is "Juan come helado . " followed by "Juan comió helado . " and so on . The decoder have learned to produced “helado” as the 3rd most probable word when processing “John eats ice cream” in the source sentence and after producing “Juan come” in the previous 2 steps.

Summing up:

  • The Encoder is called by the Translator before calling the Decoder and the variables “memory-bank” are storing the context vector.
  • The context vector is created by the Encoder, and it is actually the second output from the Encoder.forward. It is called “memory bank for attention” (memory_bank variables) in the PyTorch code.
  • The context vector is a summary of the source sentence and the attention vectors will give you the relationship among the produced word and the source words.
  • The Decoder is generating a sequence of target words by taking into account the representation of a sequence of source words inside a beam search process.

Hope that can help! :slight_smile:

(Terence Lewis) #6

Another great explanation :slight_smile:

(Duane Dougal) #7

Yes, Eva, this is fantastic and helps a lot! Thank you very, very much! Muchísimas gracias! This needs to go into the documentation. I never would have guessed that memory_bank contains the context vector.

I would like to generate a “beam search graph”–a table with tokens and probabilities–like the one you’ve illustrated here. That will help me a ton. Thanks for the idea.

You said:

The context vector is a summary of the source sentence and the attention vectors will give you the relationship among the produced word and the source words.

Can you provide an example or give additional insight? Am I right that the variable attn contains the attention vectors?

Once again, thank you very much. This has been extremely helpful!