i found you guys did not apply mask in the attention layer to calculate the attention, and you ignored the mask operation for the target sequence of one batch in the decoder ? does that make sense ? can anyone explain that for me ? thank you very much .
Are you referring to the Torch or PyTorch implementation?
Yes, I am still wondering that why training ignores source side mask. I think it might be the reason that, the attention mechanism will learn to ignore padding hidden automatically? During training, not use mask will help train the attention mechanism more discriminatively so that it will ignore unuseful info.