Multiple tokens in Source to single token in Target

I’m using OpenNMT-py and it is working very well on EN->ES. Thanks for making this code available!

I have a question: I’m trying to find and understand in the code where multiple tokens in the source (EN) map to either a single token or multiple tokens in the target (ES). For example, if I have the sentence, “John eats ice cream” and a translation in Spanish, “Juan come helado,” where in the code does it map “ice cream” to “helado” (a multiple token concept in the source to a single token concept in the target)?

I understand the NMT concepts about the encoder creating a semantic context vector and the decoder producing a translation from the semantic context vector. I can see in the _build_target_tokens def in Translation.py where tokens are selected from the target vocabulary and combined to create a translation. However, I want to find in the code where “ice cream” gets distilled into the semantic context vector as a single semantic entity (even though it is represented by two tokens in EN). I can also see in the translate_batch def in Translator.py where the encoder is run on the source. Where is the context vector generated and how do I read the context vector?

Hi Duane!

this is an interesting question :slight_smile:
This source-target mapping is resumed in the attention vector.
I don’t know very much the distribution of the python code but I think that I can give you some hints to find the information you want.

The context vector is generated by the encoder, in fact, it is the output of the encoder after processing the source sentence. This context vector encloses the information of the entire source sentence, and it is given to the decoder for the decoding process.

The decoder uses this context vector from the source sentence in combination with the previous generated word (and sometimes its previous state) to generate the next translated word.
In particular, this context vector is fed to the attention module (integrated into the decoder) who is in charge to calculate the of the current translated word being a translation from each one of the source sentence words.
If you look at the attention weights after translating your sentence “John eats ice cream” you should see that the probability that relates helado and ice is very similar to the one for helado and cream.

Looking at the code, you can see in the translate function from Translator.py, how the attention vectors (attn) are obtained from the decoder, and then they are used for the ‘copy’ mechanism during translation (this is, when an unknown word is generated, copy the source word where it comes from). Also, you can find there a Debug attention section where you can see an example of how to access to the attention information.

Hope this can help.
Good luck!
Eva

3 Likes

Excellent explanation for me too, Eva :slight_smile:

1 Like

Thank you very much, Eva. This is helpful. :blush:

However, I have an additional question. When you say that the “probability that relates helado and ice is very similar to the one for helado and cream”, why wouldn’t that produce “Juan come helado helado”? NMT isn’t really comparing the human readable words. There’s still the issue of a potential mismatch between the number of tokens between source and target that I’m trying to understand and find in the code. I need to find where the source words “ice” and “cream” are combined together into a single semantic object that equates to the target “helado.” I presume that happens in the encoder.

As you’ve described, NMT in general works like this:

src_sentence > encoder > context_vector > decoder > tgt_sentence

Although I can see in the code where the encoder is called, I am having trouble finding the resulting context vector (or which variable or object contains the context vector–there doesn’t seem to be anything called “context vector” in the code). Lots of data are produced by the encoder. Isn’t it true that he decoder looks at the context vector and not the source sentence? So by the time the context vector is produced, it appears that the number of tokens issue has already been resolved and the context vector contains semantic objects and not individual (human) words. I need to be able to watch the creation of the context vector. Can you tell me where specifically in the code that is?

Hi Duane,
sorry but the late response, but these days have been so busy.

Indeed, the decoder only uses the context_vector from the encoder as information from the source sentence (besides the encoder hidden_state at the beginning of the sentence decoding to initialize the decoder state).

The encoder is called by the Translator.translate_batch function :

enc_states, memory_bank = self.model.encoder(src, src_lengths)

so, the context vector, if I am not mistaken, is in the memory_bank variable.

where, afterwards, it is performed the beam search process. Where the decoder is called beamsize*src_sentence_token times. And it is computed a vector of batch x beam_word scores by means of the generator. So, for each beam of the search we have a vector with the probability for each word in the target vocabulary of being the next word translation.

Here is where “the magic” is done. When the beam search is done, the probabilities are combined to produce the most probable target sequence.
Recall that the decoder is producing a target sequence more than translating the source sentence word by word.
So, for your example “John eats ice cream” we will see something like that when looking at the beams tokens and its probabilities, just assume we are pursuing a 5beam search:

Juan (0.60) John (0.50) él(0.10) el(0.05) chico(0.10)
come (0.50) comes(0.45) comió(0.49) comer (0.40) comerá (0.25)
helado (0.53) helados(0.50) polo(0.48) crema(0.40) hielo (0.30)
. (0.8) . (0.8) .(0.8) helada(0.10) crema (0.10)
EOS (0.8) EOS (0.8) EOS (0.8) . (0.8) . (0.8)

So, from the beam search graph you can see how the most probable sequence is "Juan come helado . " followed by "Juan comió helado . " and so on . The decoder have learned to produced “helado” as the 3rd most probable word when processing “John eats ice cream” in the source sentence and after producing “Juan come” in the previous 2 steps.

Summing up:

  • The Encoder is called by the Translator before calling the Decoder and the variables “memory-bank” are storing the context vector.
  • The context vector is created by the Encoder, and it is actually the second output from the Encoder.forward. It is called “memory bank for attention” (memory_bank variables) in the PyTorch code.
  • The context vector is a summary of the source sentence and the attention vectors will give you the relationship among the produced word and the source words.
  • The Decoder is generating a sequence of target words by taking into account the representation of a sequence of source words inside a beam search process.

Hope that can help! :slight_smile:
Eva

3 Likes

Another great explanation :slight_smile:

Yes, Eva, this is fantastic and helps a lot! Thank you very, very much! Muchísimas gracias! This needs to go into the documentation. I never would have guessed that memory_bank contains the context vector.

I would like to generate a “beam search graph”–a table with tokens and probabilities–like the one you’ve illustrated here. That will help me a ton. Thanks for the idea.

You said:

The context vector is a summary of the source sentence and the attention vectors will give you the relationship among the produced word and the source words.

Can you provide an example or give additional insight? Am I right that the variable attn contains the attention vectors?

Once again, thank you very much. This has been extremely helpful!

Hi Duane,
sorry for the late response but those have been busy days.

Indeed, attn variable contains the attention vectors.

The context vector is a dense vector that resumes the source sentence that is build after processing each word in the source sentence.

The attention vectors gather the information of soft alignment among target and source words. And this is important, it is among target and source and not in the other direction.
Notice that the attention vectors have the dimensionality target_len x batch_size x source_len, this is, for each word in the target sentence in a batch there is a vector that has its relation with each word in the source sentence. So, for each target word we have the probability of it being translated from each one of the words in the source sentence.
Since the attention module computes a probability distribution, the sum of the components of the attention vector for each target word is one .
Maybe this example clear things up.
If we have this sentence pair:

src: John eats ice cream
tgt: Juan come helado

we would have also this set of attention vectors :

John eats ice cream .
Juan 0,8 0,10 0,05 0,025 0,025
come 0,20 0,70 0,035 0,035 0,03
helado 0,03 0,06 0,45 0,45 0,01
. 0,01 0,01 0,01 0,02 0,95

So, it is clear that, for instance, “Juan” comes from “John” since is the word with the highest probability.
If we look at “helado”, things get complicated. Although it is improbable to see a so clear tie in practice, it is totally possible that for a target word have two or more source words with the highest probability values, that means there is a many-to-one alignment.

Looking at the attention table the other way round, we can get which target words are related to each source word. For “John” it is clear that it is translated into “Juan”. But, if we look at the words “ice” and “cream”, we confirm that there is a many-to-one alignment “ice cream - helado”.

However, there are many fancier ways to visualize the attention vectors than building
attention matrixes :wink:
like the one indicated in this other post: Extracting and visualizing the decoder attention weights
or in this other one: How to visualize attention?

1 Like

This thread is really interesting, thank you Eva for your detailed explanations! :slight_smile:

Really great information, Eva. Your example is very clear and helpful. Thank you!

So, like you, I created a visualization to see the attention probabilities and to help me understand what is happening. I then translated the following:

SRC: Allow me to comment.

This produced the following, perfectly good translation:

TGT: Permítanme comentar.

Notice that this sentence pair has two many-to-one alignments. This is a real translation, not a hypothetical example. The following is the attention matrix that I generated, the values extracted from attn (note the values are rounded to keep the table readable, so each row may or may not add up to exactly 1.0, but close enough):

Allow me to comment .
Permítanme 0.1913 0.2262 0.0891 0.2291 0.2645
comentar 0.0349 0.0594 0.0744 0.4771 0.3542
. 0.0253 0.0359 0.0207 0.0764 0.8417

As you can see, “comment” is translated as “comentar” and has the highest attention probability in the second row (0.4771). (This should be “to comment” translated as “comentar,” a many-to-one alignment.) Likewise, the period at the end of the sentence is generated correctly with the highest attention probability in the third row (0.8417). These are both just fine.

However, the first row is baffling. “Permítanme” is a correct translation for “Allow me” (also a many-to-one alignment). The attention probabilities for “Allow” (0.1913) and “me” (0.2262) are vaguely similar, but neither is the highest probability value in the row and they aren’t as similar as other probability values in the same row. In other words, the many-to-one alignment isn’t represented by the highest probability value pair. Can you explain what is happening here? How did this produce a correct translation?

I provide additional information that may be helpful. First, the full attention probabilities and, second, the full beam.

Vector shape: (1, 5, 5)
[0]
0.1913 0.2262 0.0891 0.2291 0.2645
0.1757 0.2310 0.0700 0.1987 0.3246
0.1757 0.2310 0.0700 0.1987 0.3246
0.1757 0.2310 0.0700 0.1987 0.3246
0.1757 0.2310 0.0700 0.1987 0.3246
[1]
0.0349 0.0594 0.0744 0.4771 0.3542
0.0320 0.0661 0.0891 0.4245 0.3883
0.0431 0.0797 0.0952 0.3851 0.3968
0.0297 0.0506 0.0424 0.5985 0.2789
0.3058 0.0845 0.1661 0.2458 0.1978
[2]
0.0114 0.0398 0.0215 0.5926 0.3348
0.0253 0.0359 0.0207 0.0764 0.8417
0.0255 0.0419 0.0276 0.0783 0.8267
0.0359 0.0429 0.0524 0.5382 0.3305
0.0115 0.0277 0.0216 0.5994 0.3399
[3]
0.0324 0.0409 0.0169 0.0644 0.8454
0.0175 0.0168 0.0071 0.0599 0.8988
0.0342 0.0425 0.0208 0.0740 0.8285
0.0301 0.0116 0.0451 0.6814 0.2318
0.0513 0.0134 0.0142 0.5761 0.3449

The numbers in square brackets in the following full beam table are the indexes of the tokens in the target vocabulary.

Beam1 Beam2 Beam3 Beam4 Beam5
Col0 <s> [2] <blank> [1] <blank> [1] <blank> [1] <blank> [1]
Col1 Permí­tanme [957] Permítame [3979] Déjenme [13455] Permítaseme [16775] Me [193]
Col2 que [8] comentar [2289] comentar [2289] hacer [92] que [8]
Col3 . [6] comente [15330] . [6] me [73] un [17]
Col4 </s> [3] </s> [3] . [6] comentario [1585] refiera [9494]

The translation produced seems to come from:

  • Beam1, Col0: “<s>”

  • Beam1, Col1: “Permí­tanme”

  • Beam2/Beam3, Col2: “comentar”

  • Beam1/Beam3, Col3: “.”

  • Beam1/Beam2, Col4: “</s>”

@Scuba Did you happen to get an explanation for this. I’m struggling with one-to-many thing in my dataset as well.

Mohammed Ayub