About mask in attention layer and decoder

pytorch

(Wen Zhang) #1

hello, guys,

i found you guys did not apply mask in the attention layer to calculate the attention, and you ignored the mask operation for the target sequence of one batch in the decoder ? does that make sense ? can anyone explain that for me ? thank you very much .


(Guillaume Klein) #2

Hello,

Are you referring to the Torch or PyTorch implementation?


(Wen Zhang) #3

the PyTorch one


(Guillaume Klein) #4

cc @srush.


(Guanlin Li) #5

Yes, I am still wondering that why training ignores source side mask. I think it might be the reason that, the attention mechanism will learn to ignore padding hidden automatically? During training, not use mask will help train the attention mechanism more discriminatively so that it will ignore unuseful info.