Translation models with alignment?

I’m trying to figure out which exact translation models supports word alignment. Can’t find any documentation. Does Transformer, TransformerAAN, TransformerBig, TransformerBigFP16, or TransformerFP16 supports word alignments in prediction if trained with appropriate alignment flags? Also, I’m assuming appropriate model that supports word alignment will generate alignments alongside the translation in prediction which I can use with TensorFlow Serving, right?

All single source Transformer models support returning an alignment vector, including during serving. For it to be usable, the model should be trained with guided alignment:

https://opennmt.net/OpenNMT-tf/alignments.html

About the alignments file. I’m planning to use fast_align to get the alignments. As input files to fast_align should be tokenized, how should I proceed further with this? I already have OpenNMT vocab, SentencePiece tokenizer/vocab and raw data files.

I’ve first tokenize raw files with SentencePiece tokenizer and I’m planning to use it as input to fast_align to get required alignments. But i’m not sure what exactly my input to fast_align should be. I would really appreciate any clarification on this one.

  1. with “▁” in fast-align input???
raw files sample: 
    en.txt: 
        this is a sample test!  
    es.txt: 
        Esta es una prueba de muestra!

SentencePiece tokenized files sample:
    en.sp.txt: 
        ▁this ▁is ▁a ▁sample ▁test !
    es.sp.txt: 
        ▁Esta ▁es ▁una ▁prueba ▁de ▁muestra !

fast_align input file to get allignments:
    ▁this ▁is ▁a ▁sample ▁test ! ||| ▁Esta ▁es ▁una ▁prueba ▁de ▁muestra !
  1. without “▁” in fast-align input???
raw files sample: 
    en.txt: 
        this is a sample test!  
    es.txt: 
        Esta es una prueba de muestra!

SentencePiece tokenized files sample:
    en.sp.txt: 
        ▁this ▁is ▁a ▁sample ▁test !
    es.sp.txt: 
        ▁Esta ▁es ▁una ▁prueba ▁de ▁muestra !

fast_align input file to get allignments:
    this is a sample test ! ||| Esta es una prueba de muestra !

You should feed the tokenized data to fast_align, so with the ▁ token.

1 Like

Just to make sure, I also have to use tokenized data with ▁ token during training, right?

train:
    en.train.txt:
        ▁this ▁is ▁a ▁sample ▁test !
    es.train.txt:
        ▁Esta ▁es ▁una ▁prueba ▁de ▁muestra !

Yes, that is right.

1 Like

Hi spatel, after getting the alignments for the _token text, how are you getting the alignments at word level for post-processing ?

1 Like

You can try using the function detokenize_with_ranges from OpenNMT/Tokenizer. It provides a basic way to map tokens to ranges in the detokenized text.

Hi,
Any example of an implementation to get word level aligments from subword level aligments using detokenize_with_ranges??

It seems it is not performing correctly when joiner occurs after subword:

¿■ ⦅mrk_case_modifier_C⦆ es posible ■?
¿Es posible? {0: (0, 1), 2: (2, 3), 3: (5, 11), 4: (12, 12)}

For ¿ it is counting ■ also. See (0,1)

The ranges are in bytes, and indeed the character ¿ is composed of 2 bytes.

That seems the reason, as some accents are also beeing counted twice.
How can we know when 2 bytes are counted?

You should view the text as a sequence of bytes if you want to extract the ranges:

>>> "¿Es posible?".encode("utf-8")[0:1+1].decode("utf-8")
'¿'

We could add an option to return ranges on Unicode characters.

Thanks.
That would be helpful for coding issues

For reference, the new argument was added in:

which is available in version 1.19.0.

>>> tokens = ["¿■", "⦅mrk_case_modifier_C⦆", "es", "posible", "■?"]
>>> _, ranges = tokenizer.detokenize_with_ranges(tokens, unicode_ranges=True)
>>> ranges
{0: (0, 0), 2: (1, 2), 3: (4, 10), 4: (11, 11)}
1 Like