Attention in forced decoding is not adjusting correctly to tgt file

I have a source sentence in German and a target sentence in English and I’m using the pre-trained DE-EN model that comes with OpenNMT-py.

My objective is to generate attention matrix only from the tgt file, it’s not allowed to come up with other translations. I just want the (%) confidence.

src_file:
das daten ist vorbei .

tgt_file:
the date is over .

If I run translate.py, the beam search outputs:

              das      daten        ist     vorbei          . 
   the  0.3053247 *0.4235602  0.0841463  0.0491322  0.1378366 
  data  0.0386287 *0.9214475  0.0232268  0.0106175  0.0060794 
    is  0.0516151  0.0754700 *0.4066976  0.3003595  0.1658577 
  over  0.0139861  0.0257822  0.0448820 *0.8648529  0.0504968 
     .  0.0078235  0.0125427  0.0771239  0.0835686 *0.8189413 
  </s>  0.0402967  0.0504099  0.0682272  0.0769914 *0.7640749 

However if I add -tgt tgt_file and print attn in _score_target() method (in /OpenNMT-py/onmt/translate/translator.py) I would get:

tensor([[[0.3053, 0.4236, 0.0841, 0.0491, 0.1378]],
        [[0.0386, 0.9214, 0.0232, 0.0106, 0.0061]], <- this row is "date"
        [[0.0631, 0.0851, 0.4130, 0.2787, 0.1601]],
        [[0.0165, 0.0332, 0.0391, 0.8633, 0.0480]],
        [[0.0107, 0.0200, 0.0670, 0.0821, 0.8202]],
        [[0.0448, 0.0610, 0.0644, 0.0758, 0.7540]]])

which is pretty similar to output from beam search. note that I’ve suggested “daten” to be translated as “date” instead of “data” but I’m still getting the same attention weights.

Now if I replace my target sentence with something totally wrong, eg, “the data is new .”, I’d still get a very similar attention matrix:

tensor([[[0.3053, 0.4236, 0.0841, 0.0491, 0.1378]],
        [[0.0386, 0.9214, 0.0232, 0.0106, 0.0061]],
        [[0.0516, 0.0755, 0.4067, 0.3004, 0.1659]],
        [[0.0140, 0.0258, 0.0449, 0.8649, 0.0505]], <- this row is "new"
        [[0.0206, 0.0637, 0.1246, 0.2603, 0.5308]],
        [[0.0408, 0.0604, 0.0655, 0.0653, 0.7681]]])

What am I doing wrong here?

The first attention vector is actually for <s>.

              das      daten        ist     vorbei          . 
   <s>  0.3053247 *0.4235602  0.0841463  0.0491322  0.1378366 
   the  0.0386287 *0.9214475  0.0232268  0.0106175  0.0060794 
  data  0.0516151  0.0754700 *0.4066976  0.3003595  0.1658577 
    is  0.0139861  0.0257822  0.0448820 *0.8648529  0.0504968 
  over  0.0078235  0.0125427  0.0771239  0.0835686 *0.8189413 
     .  0.0402967  0.0504099  0.0682272  0.0769914 *0.7640749

Can you check if that helps?

Oh. Just to be clear <s> is for start and appended only to target. Correct?

This would now make sense as changing “over” for example would cause 5th row to change, however if that’s the correct ordering of target tokens, then:

data -> ist
is -> vorbei
over -> . 

which wouldn’t make sense. am I supposed to interpret the matrix in a different way?

Yes.

Mmh, how does it look with another example and also without forced decoding?

Okay this is strange, I have no <s> and instead I have </s>.

I thought maybe I’m behind on commits so I did a pull but the issue still stands.

To replicate my output, download this model: https://s3.amazonaws.com/opennmt-models/iwslt-brnn2.s131_acc_62.71_ppl_7.74_e20.pt and put it inside available_models folder in your OpenNMT-py folder.

Create a src_file :
du bist da .

Run translation using :
python3 translate.py -model available_models/iwslt-brnn2.s131_acc_62.71_ppl_7.74_e20.pt -src src_file -replace_unk -verbose -attn_debug

You should get :

SENT 1: ['du', 'bist', 'da', '.']
PRED 1: you are there .
PRED SCORE: -1.0267
                   du       bist         da          . 
       you *0.4256270  0.2274503  0.0907743  0.2561484 
       are  0.0489827 *0.4282851  0.1894807  0.3332514 
     there  0.0291599  0.0926688 *0.6324173  0.2457540 
         .  0.0419731  0.0895388  0.0457374 *0.8227507 
      </s>  0.0450034  0.0496944  0.0449075 *0.8603948 
PRED AVG SCORE: -0.2567, PRED PPL: 1.2926

Now in file /OpenNMT-py/onmt/translate/translator.py line 807, add :
print (attn)

Then create a file named tgt_file :
you are there .

And run translation using :
python3 translate.py -model available_models/iwslt-brnn2.s131_acc_62.71_ppl_7.74_e20.pt -src src_file -tgt tgt_file -replace_unk -verbose -attn_debug

You should see:

tensor([[[0.4256, 0.2275, 0.0908, 0.2561]],

        [[0.0490, 0.4283, 0.1895, 0.3333]],

        [[0.0292, 0.0927, 0.6324, 0.2458]],

        [[0.0420, 0.0895, 0.0457, 0.8228]],

        [[0.0450, 0.0497, 0.0449, 0.8604]]])

SENT 1: ['du', 'bist', 'da', '.']
PRED 1: you are there .
PRED SCORE: -1.0267
GOLD 1: you are there .
GOLD SCORE: -1.0267
                   du       bist         da          . 
       you *0.4256270  0.2274503  0.0907743  0.2561484 
       are  0.0489827 *0.4282851  0.1894807  0.3332514 
     there  0.0291599  0.0926688 *0.6324173  0.2457540 
         .  0.0419731  0.0895388  0.0457374 *0.8227507 
      </s>  0.0450034  0.0496944  0.0449075 *0.8603948 
PRED AVG SCORE: -0.2567, PRED PPL: 1.2926
GOLD AVG SCORE: -0.2053, GOLD PPL: 1.2279

As you can see tensor from forced decoding is identical to beam search, so I assume the ordering of tokens + </s> is supposed to be the same.

And changing there in tgt_file to here causes 4th attention vector to change. Same issue as my initial post. Help?

The <s> is part of the decoder input and explains why changing the i-th word changes the row i+1. The i-th row is typically formed by computing attention between the i-th decoder input and the source encoding where the decoder input is for example <s> you are there .

Then the i-th decoder input will empirically pay attention to the source word producing the decoder output i+1.

Does that make sense?

So if I understood you correctly, <s> is an input to decoder which in turn generates you
Thus the matrix would be presented more accurately like this :

 input    output         du       bist         da          . 
   <s>       you *0.4256270  0.2274503  0.0907743  0.2561484 
   you       are  0.0489827 *0.4282851  0.1894807  0.3332514 
   are     there  0.0291599  0.0926688 *0.6324173  0.2457540 
 there         .  0.0419731  0.0895388  0.0457374 *0.8227507 
     .      </s>  0.0450034  0.0496944  0.0449075 *0.8603948 

Is this correct?

Also I’m still confused, since my end goal is to get confidence for a word. Like in this table there (as output) is mapped to da with 63% probability. If I substitute it with here then I’m looking at 4th row which is responsible for . output.

Either my assumption is wrong or this is not the right way to do this?

The table is correct.

So to achieve your end goal, you should simply consider the output index like you did initially even though changing a word only changes the next row.

Sorry I also got confused here.