I have trained Spanish to English model with alignments. As I have used SentencePiece for tokenization, my translation model accepts text tokenized by SentencePiece model and outputs results which I have de-tokenize later with SentencePiece model to get the normal text.
The alignments I’m getting are based on tokenized text. (since the alignments were generated using fast_align
on tokenized text as mentioned by guillaumekln. How do I map these tokenized alignments to tokens in de-tokenized text?
de-tokenized source/translated text:
-
es:
'Cómo evitar que Facebook publique malos recuerdos en tu muro'
-
en:
'How to avoid Facebook publishing bad memories in your wall'
model output:
{'alignments': [0, 0], 'mapping': ['▁Cómo', '▁How']}
{'alignments': [1, 1], 'mapping': ['▁evitar', '▁to']}
{'alignments': [1, 2], 'mapping': ['▁evitar', '▁avoid']}
{'alignments': [3, 3], 'mapping': ['▁Fa', '▁F']}
{'alignments': [5, 4], 'mapping': ['book', 'ace']}
{'alignments': [5, 5], 'mapping': ['book', 'book']}
{'alignments': [6, 6], 'mapping': ['▁publique', '▁publishing']}
{'alignments': [7, 7], 'mapping': ['▁malos', '▁bad']}
{'alignments': [8, 8], 'mapping': ['▁recuer', '▁memories']}
{'alignments': [10, 9], 'mapping': ['▁en', '▁in']}
{'alignments': [11, 10], 'mapping': ['▁tu', '▁your']}
{'alignments': [12, 11], 'mapping': ['▁muro', '▁wall']}