Custom attention mask

Is it possible in either the py or tf version to add a custom attention mask to the input? So that in addition to the padding tokens also other tokens are masked?

If it is not currently supported, where should I have a look to implement this myself? I imagine a separate file with 0s and 1s for all tokens, or some sort of word feature appended to each token to indicate which words should get attention and which can be ignored. Perhaps an alternative is swapping out the respective tokens with padding tokens. Would that work? If so, what’s onmt’s padding token?


This sounds possible but you’ll still need to edit the code to generate masks based on the padding id and not the sequence length. See for example the Transformer encoder in OpenNMT-py (the code is similar in the decoder):

The padding token is <blank>.

Adding a new input file is generally more painful. If you go in that direction, I believe it is easier to do in OpenNMT-tf as there is no preprocessing step.

Thanks for the reply.

Modifying source and our input data seems easier, indeed. So if I am correct the steps are the following:

  • replace the respective tokens by <blank>;
  • modify the forward of the encoder so that the mask is determined by the padding tokens rather than lengths alone. Probably want to return the mask here rather than just the lengths;
  • in the model, pass the returned src_mask (previously lengths) to the decoder;
  • in the decoder, directly use the passed src_mask rather than (again) calculating the mask based on sequence length.

In onmt, should the masked tokens (e.g. the padding token) be 0 or 1? From reading the code I would guess 0.

This sounds about right.

The padding index is 1 in OpenNMT-py.

I think you misunderstood. I don’t mean the padding index but I wonder what the value for masked tokens in the mask tensor is. There are some differences out there: sometimes 1 is used for the masked tokens, other times 0.

Looks like padding positions should be set to 1:

1 Like

Great, I’m having a quick look now. Could you explain to me the sizes of the inputs in the encoder/decoder forward? For batch size three (as a test) I am getting

src torch.Size([6, 3, 2])
tgt, torch.Size([11, 3, 1])

What is that last dimension, and why is there a difference between the encoder (src) and decoder (tgt)?

I found out this last dimension are the features, including the tokens themselves. (So in my case two source features (token, and another) and one target (token))

1 Like

I am running into some issues during translation. During training everything works fine, but during translation the multiheadattention in the decoder throws an error about the mask size.

Traceback (most recent call last):
  File "/home/bram/.local/share/virtualenvs/nfr-experiments-3R5lX5O6/bin/onmt_translate", line 11, in <module>
    load_entry_point('OpenNMT-py', 'console_scripts', 'onmt_translate')()
  File "/home/bram/Python/projects/nfr-experiments/OpenNMT-py/onmt/bin/", line 48, in main
  File "/home/bram/Python/projects/nfr-experiments/OpenNMT-py/onmt/bin/", line 25, in translate
  File "/home/bram/Python/projects/nfr-experiments/OpenNMT-py/onmt/translate/", line 361, in translate
    batch_data = self.translate_batch(
  File "/home/bram/Python/projects/nfr-experiments/OpenNMT-py/onmt/translate/", line 550, in translate_batch
    return self._translate_batch_with_strategy(batch, src_vocabs,
  File "/home/bram/Python/projects/nfr-experiments/OpenNMT-py/onmt/translate/", line 674, in _translate_batch_with_strategy
    log_probs, attn = self._decode_and_generate(
  File "/home/bram/Python/projects/nfr-experiments/OpenNMT-py/onmt/translate/", line 589, in _decode_and_generate
    dec_out, dec_attn = self.model.decoder(
  File "/home/bram/.local/share/virtualenvs/nfr-experiments-3R5lX5O6/lib/python3.8/site-packages/torch/nn/modules/", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/bram/Python/projects/nfr-experiments/OpenNMT-py/onmt/decoders/", line 319, in forward
    output, attn, attn_align = layer(
  File "/home/bram/.local/share/virtualenvs/nfr-experiments-3R5lX5O6/lib/python3.8/site-packages/torch/nn/modules/", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/bram/Python/projects/nfr-experiments/OpenNMT-py/onmt/decoders/", line 93, in forward
    output, attns = self._forward(*args, **kwargs)
  File "/home/bram/Python/projects/nfr-experiments/OpenNMT-py/onmt/decoders/", line 165, in _forward
    mid, attns = self.context_attn(memory_bank, memory_bank, query_norm,
  File "/home/bram/.local/share/virtualenvs/nfr-experiments-3R5lX5O6/lib/python3.8/site-packages/torch/nn/modules/", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/bram/Python/projects/nfr-experiments/OpenNMT-py/onmt/modules/", line 202, in forward
    scores = scores.masked_fill(mask, -1e18)
RuntimeError: The size of tensor a (30) must match the size of tensor b (150) at non-singleton dimension 0

After doing some logging I found that the encoder does not have any issues, but that the decoder cannot simply use the src_mask that was created in the encoder. I am not sure why. Here are the logs, and you can see that in the first decoder layer multi-head attention will fail because there is a shape mismatch in the code snippet that you linked. So the mask should be filled for the padded tokens, but the dimensions do not match.

Entering encoder layer 0
score torch.Size([30, 8, 223, 223])
mask torch.Size([30, 1, 1, 223])

Entering encoder layer 1
score torch.Size([30, 8, 223, 223])
mask torch.Size([30, 1, 1, 223])

Entering encoder layer 2
score torch.Size([30, 8, 223, 223])
mask torch.Size([30, 1, 1, 223])

Entering encoder layer 3
score torch.Size([30, 8, 223, 223])
mask torch.Size([30, 1, 1, 223])

Entering encoder layer 4
score torch.Size([30, 8, 223, 223])
mask torch.Size([30, 1, 1, 223])

Entering encoder layer 5
score torch.Size([30, 8, 223, 223])
mask torch.Size([30, 1, 1, 223])

decoder_input torch.Size([1, 150, 1])
src_pad_mask torch.Size([30, 1, 223]) # here
tgt_pad_mask torch.Size([150, 1, 1])

Entering decoder layer 0
score torch.Size([150, 8, 1, 223])
mask torch.Size([30, 1, 1, 223]) # here

So it seems that the decoder expects a much larger attention mask then the encoder returns, but I do not understand why. Do you have any thoughts on this? Especially odd to me since this does work in training.

I only made a few changes to the encoder/decoder, which you can find here:

Most likely you are running beam search. You should repeat each batch in the tensor beam_size times.

For example, you can see the memory outputs and state being repeated in this function:

Ah that makes sense. I never realised that beam search was enabled by default in translation mode.

So in the Strategy initializer I now tile the mask, like so:

tile(src_mask, self.beam_size)

And return the value, too so that it can be used in the decoding steps loop. Updated in this commit:

Funnily enough, I am still getting a size mismatch:

decoder_input torch.Size([1, 145, 1])

src_pad_mask torch.Size([150, 1, 223])
tgt_pad_mask torch.Size([145, 1, 1])

Entering decoder layer 0
score torch.Size([145, 8, 1, 223])
mask torch.Size([150, 1, 1, 223])

Should the target mask be padded up to the max src length? I don’t think so, or at least I don’t see that happening in the original forward:

Another thing that is happening is that finished translations are removed from the batch. So you should update your tensor accordingly:

Ah yes, noted. It seems to work well now.

Is this change useful for upstream? If so, I can do a PR which basically allows tokens to be inserted anywhere in source and not receiving attention.

@vince62s @francoishernandez Would you accept a PR for that?

This looks like an interesting idea indeed. I thought of implementing such a feature at some point but never got around to doing it.
How would this be specified in the args though? Some additional file with the masks on each line? Or would it be a list of specific tokens to ignore?

I would prefer either

  • an additional file that should contain 0 and 1 to mask/unmask tokens. That means that every line should the same number of tokens as there are 0s and 1s; or, preferably:
  • a flag in args such as --custom_attention_mask, which uses the features pipe format. If the flag is enabled, the first “feature” would actually indicate whether or not a token is masked or not. So e.g. i|0|c like|1|l cookies|1|l

Your first proposition is what I meant by my first one. Seems good to me.
(I fear the feature one might be a bit misleading.)
Edit: sorry, everything was clear in your initial post but I read too quickly.

Also, @BramVanroy may I ask what would be your use case for such a feature?What I had in mind may not exactly be what you imply here in fact.

Would the rationale be something like MASS or BART ? Probably not though, since you would use a fixed mask here.

Would be nice if you could elaborate a bit. Thanks!

We are building on previous work considering neural fuzzy repair, which builds on integrating fuzzy match targets on the source side to direct the system to a correct translation (so your source input consists of the source text + noisy (fuzzy) target). For a new paper we are trying out a number of different strategies. One of the things that we wanted to try was to see how we can tell the system which parts of the fuzzy match are relevant and which aren’t.

We tried a number of approaches (look forward to the paper, which is in the works), and one of them is to add a feature to fuzzy tokens, telling the system which ones are actually matched and which are just noise. This binary feature works quite well. I personally also wanted to train what would happen if we just block all attention that these irrelevant tokens receive, hence this use case. I found that this does not work well and that a matched/not_matched feature works much better. My explanation would be that the context that these “irrelevant tokens” provide is still quite useful to the system, even though their direct, concrete meaning is not relevant.

But if I understand correctly, the mask does not work well compared to the features in your experiments, so it might not be that useful as a feature after all.
Anyways, if it’s not too much of a hassle you can always sent a PR and we might merge it. It could be good if some people want to further experiment down this path.