Really great information, Eva. Your example is very clear and helpful. Thank you!
So, like you, I created a visualization to see the attention probabilities and to help me understand what is happening. I then translated the following:
SRC: Allow me to comment.
This produced the following, perfectly good translation:
TGT: Permítanme comentar.
Notice that this sentence pair has two many-to-one alignments. This is a real translation, not a hypothetical example. The following is the attention matrix that I generated, the values extracted from attn
(note the values are rounded to keep the table readable, so each row may or may not add up to exactly 1.0, but close enough):
|
Allow |
me |
to |
comment |
. |
Permítanme |
0.1913 |
0.2262 |
0.0891 |
0.2291 |
0.2645 |
comentar |
0.0349 |
0.0594 |
0.0744 |
0.4771 |
0.3542 |
. |
0.0253 |
0.0359 |
0.0207 |
0.0764 |
0.8417 |
As you can see, “comment” is translated as “comentar” and has the highest attention probability in the second row (0.4771). (This should be “to comment” translated as “comentar,” a many-to-one alignment.) Likewise, the period at the end of the sentence is generated correctly with the highest attention probability in the third row (0.8417). These are both just fine.
However, the first row is baffling. “Permítanme” is a correct translation for “Allow me” (also a many-to-one alignment). The attention probabilities for “Allow” (0.1913) and “me” (0.2262) are vaguely similar, but neither is the highest probability value in the row and they aren’t as similar as other probability values in the same row. In other words, the many-to-one alignment isn’t represented by the highest probability value pair. Can you explain what is happening here? How did this produce a correct translation?
I provide additional information that may be helpful. First, the full attention probabilities and, second, the full beam.
Vector shape: (1, 5, 5)
[0]
0.1913 0.2262 0.0891 0.2291 0.2645
0.1757 0.2310 0.0700 0.1987 0.3246
0.1757 0.2310 0.0700 0.1987 0.3246
0.1757 0.2310 0.0700 0.1987 0.3246
0.1757 0.2310 0.0700 0.1987 0.3246
[1]
0.0349 0.0594 0.0744 0.4771 0.3542
0.0320 0.0661 0.0891 0.4245 0.3883
0.0431 0.0797 0.0952 0.3851 0.3968
0.0297 0.0506 0.0424 0.5985 0.2789
0.3058 0.0845 0.1661 0.2458 0.1978
[2]
0.0114 0.0398 0.0215 0.5926 0.3348
0.0253 0.0359 0.0207 0.0764 0.8417
0.0255 0.0419 0.0276 0.0783 0.8267
0.0359 0.0429 0.0524 0.5382 0.3305
0.0115 0.0277 0.0216 0.5994 0.3399
[3]
0.0324 0.0409 0.0169 0.0644 0.8454
0.0175 0.0168 0.0071 0.0599 0.8988
0.0342 0.0425 0.0208 0.0740 0.8285
0.0301 0.0116 0.0451 0.6814 0.2318
0.0513 0.0134 0.0142 0.5761 0.3449
The numbers in square brackets in the following full beam table are the indexes of the tokens in the target vocabulary.
|
Beam1 |
Beam2 |
Beam3 |
Beam4 |
Beam5 |
Col0 |
<s> [2] |
<blank> [1] |
<blank> [1] |
<blank> [1] |
<blank> [1] |
Col1 |
Permítanme [957] |
Permítame [3979] |
Déjenme [13455] |
Permítaseme [16775] |
Me [193] |
Col2 |
que [8] |
comentar [2289] |
comentar [2289] |
hacer [92] |
que [8] |
Col3 |
. [6] |
comente [15330] |
. [6] |
me [73] |
un [17] |
Col4 |
</s> [3] |
</s> [3] |
. [6] |
comentario [1585] |
refiera [9494] |
The translation produced seems to come from:
-
Beam1, Col0: “<s>”
-
Beam1, Col1: “Permítanme”
-
Beam2/Beam3, Col2: “comentar”
-
Beam1/Beam3, Col3: “.”
-
Beam1/Beam2, Col4: “</s>”