I am trying to get the word alignment of a translated sentence and encountered some questions
Let’s say I want to translate the following sentence: I love my dog.
. SentencePiece tokenizes it to ▁I ▁love ▁my ▁dog .
.
The translation and word alignment is: ▁I ch ▁ liebe ▁me inen ▁Hund . ||| 0-0 1-2 1-3 2-4 2-5 3-6 4-1 4-7
. This alignment looks however inaccurate to me. For example I
(Ich
) should not only be 0-0
but also 0-1
.
So I guess i am looking into improving the alignment accuracy. The documentation indicates that reference alignment files can help to improve the results here.
Would the following approach be reasonable here to get better word alignment results?
- Tokenize the source and target files
- Run them through
fast_align
- Add the resulting file as
path_align
to the config file - Retrain the model
I am however a little bit unsure: How does fast_align
know what is the correct alignment of the SentencePiece tokens? I don’t quite get how this would work