Questions about word alignment with SentencePiece

I am trying to get the word alignment of a translated sentence and encountered some questions :slight_smile:

Let’s say I want to translate the following sentence: I love my dog.. SentencePiece tokenizes it to ▁I ▁love ▁my ▁dog ..

The translation and word alignment is: ▁I ch ▁ liebe ▁me inen ▁Hund . ||| 0-0 1-2 1-3 2-4 2-5 3-6 4-1 4-7. This alignment looks however inaccurate to me. For example I (Ich) should not only be 0-0 but also 0-1.

So I guess i am looking into improving the alignment accuracy. The documentation indicates that reference alignment files can help to improve the results here.

Would the following approach be reasonable here to get better word alignment results?

  1. Tokenize the source and target files
  2. Run them through fast_align
  3. Add the resulting file as path_align to the config file
  4. Retrain the model

I am however a little bit unsure: How does fast_align know what is the correct alignment of the SentencePiece tokens? I don’t quite get how this would work :slight_smile:

Hello!

It does not matter whether these are words, subwords, or otherwise. The tool trains a probabilistic model based on the bilingual corpus you provide, and then aligns each sentence accordingly. It falls under the same category of statistical IBM word alignment models, such as Giza++ used in statistical phrase-based machine translation. You can find more details in their paper.

Recently, researchers have been working also on neural word alignment tools, some of which you can find on my list here:

Nevertheless, I would say, even with their traditional approaches, fast_align and eflomal/efmaral are still quick and good choices for many scenarios.

All the best,
Yasmin

2 Likes