I am trying to get the word alignment of a translated sentence and encountered some questions
Let’s say I want to translate the following sentence:
I love my dog.. SentencePiece tokenizes it to
▁I ▁love ▁my ▁dog ..
The translation and word alignment is:
▁I ch ▁ liebe ▁me inen ▁Hund . ||| 0-0 1-2 1-3 2-4 2-5 3-6 4-1 4-7. This alignment looks however inaccurate to me. For example
Ich) should not only be
0-0 but also
So I guess i am looking into improving the alignment accuracy. The documentation indicates that reference alignment files can help to improve the results here.
Would the following approach be reasonable here to get better word alignment results?
- Tokenize the source and target files
- Run them through
- Add the resulting file as
path_alignto the config file
- Retrain the model
I am however a little bit unsure: How does
fast_align know what is the correct alignment of the SentencePiece tokens? I don’t quite get how this would work