How to map alignments from tokenized text to de-tokenized text?

I have trained Spanish to English model with alignments. As I have used SentencePiece for tokenization, my translation model accepts text tokenized by SentencePiece model and outputs results which I have de-tokenize later with SentencePiece model to get the normal text.

The alignments I’m getting are based on tokenized text. (since the alignments were generated using fast_align on tokenized text as mentioned by guillaumekln. How do I map these tokenized alignments to tokens in de-tokenized text?

de-tokenized source/translated text:

  • es: 'Cómo evitar que Facebook publique malos recuerdos en tu muro'
  • en: 'How to avoid Facebook publishing bad memories in your wall'

model output:

{'alignments': [0, 0], 'mapping': ['▁Cómo', '▁How']}
{'alignments': [1, 1], 'mapping': ['▁evitar', '▁to']}
{'alignments': [1, 2], 'mapping': ['▁evitar', '▁avoid']}
{'alignments': [3, 3], 'mapping': ['▁Fa', '▁F']}
{'alignments': [5, 4], 'mapping': ['book', 'ace']}
{'alignments': [5, 5], 'mapping': ['book', 'book']}
{'alignments': [6, 6], 'mapping': ['▁publique', '▁publishing']}
{'alignments': [7, 7], 'mapping': ['▁malos', '▁bad']}
{'alignments': [8, 8], 'mapping': ['▁recuer', '▁memories']}
{'alignments': [10, 9], 'mapping': ['▁en', '▁in']}
{'alignments': [11, 10], 'mapping': ['▁tu', '▁your']}
{'alignments': [12, 11], 'mapping': ['▁muro', '▁wall']}
1 Like

Oops! Looks like I forgot one obvious detail: as each token starts with symbol, I can just look for that to determine '▁F', 'ace', 'book' = Facebook

You might be interested in the detokenize_with_ranges function from the OpenNMT Tokenizer.

It returns a mapping between token ids and character ranges in the detokenized text.