Guided alignment with multi-source

I found a related question here, but OP decided to use multi-feature instead.

I just read the paper on guided alignment and I think it could benefit our set-up. However, I am not sure about the YAML file here. Should I just split train_alignments up into two? E.g.:

  - alignments_1.txt
  - alignments_2.txt

Three additional questions:

  • from the paper, it would seem that the best results were retrieved by using CELoss and 2:1 weight. If I read the paper correctly, that means 2 for the decoder. However, I can’t find the decoder weight in the config file. I then assume that the decoder weight is always 1 and that I should use 0.5 for guided_alignment_weight. Is that correct?
  • is null alignment allowed? In other words, can you have a significant number of data that is not aligned (indices are not in the GIZA alignment string)?
  • is it possible to only use guided alignment on one source?


Guided alignment is applied on the attention vector returned by the decoder. For a multi-source Transformer model, this vector is the first attention head from the last attention layer applied to the first source.

Currently, it is mostly designed to make Transformer models return usable alignments, not as way to improve the overall performance. Is it what you are looking for?


There can be some words that are not aligned. The alignment string is used as indices in a sparse matrix.

As mentioned above, it only supports one source at the moment.

I forgot to reply to you directly, so reposting:

I am indeed looking for a way to improve over our baseline system. For some of our sources, if provided, we know that not all of its information is useful. So I am wondering whether hinting to the attention layer which parts are relevant to, and aligned with, the target, could help the model distinguish between relevant and irrelevant information. Would you think that make a sensible use-case for guided alignment?

However, that begs the question what to do in the case where that source is empty (in which case we used a custom token, as you suggested). Having 0-0 is not semantically correct, but I suppose that the alignment string cannot be empty either?

Important edit: I mistakenly said I want to tell the model which parts in one of the sources is relevant to the source, but I meant of course to the target.

The concern is that the current implementation focuses on a very small part of all attention vectors (the first head of the last layer). I don’t expect this to have a visible impact on model performance. Maybe we could guide the first head of all layer, but that’s something that would require some experimentation. We did not really explore this area.

The alignment string can be empty. It will just produce an alignment matrix full of zeros.

That sounds like a fair point indeed. Any reason why it is implemented only for the first head? In other words, what are the possible downsides of using it for all heads? In the paper I don’t see such explicit statement, but perhaps because the paper is from before the multi-head attention era (if I’m not mistaken).

This was inspired by the implementation of the same feature in Marian.

8 posts were split to a new topic: Guided alignment and weighted datasets