About mask in attention layer and decoder

zhang-wen · September 7, 2017, 7:04am

hello, guys,

i found you guys did not apply mask in the attention layer to calculate the attention, and you ignored the mask operation for the target sequence of one batch in the decoder ? does that make sense ? can anyone explain that for me ? thank you very much .

guillaumekln · September 7, 2017, 8:25am

Hello,

Are you referring to the Torch or PyTorch implementation?

zhang-wen · September 7, 2017, 8:28am

the PyTorch one

guillaumekln · September 7, 2017, 8:39am

cc @srush.

Epsilon-Lee · September 12, 2017, 3:32pm

Yes, I am still wondering that why training ignores source side mask. I think it might be the reason that, the attention mechanism will learn to ignore padding hidden automatically? During training, not use mask will help train the attention mechanism more discriminatively so that it will ignore unuseful info.