Translation models with alignment?

spatel · March 18, 2019, 5:07pm

I’m trying to figure out which exact translation models supports word alignment. Can’t find any documentation. Does Transformer, TransformerAAN, TransformerBig, TransformerBigFP16, or TransformerFP16 supports word alignments in prediction if trained with appropriate alignment flags? Also, I’m assuming appropriate model that supports word alignment will generate alignments alongside the translation in prediction which I can use with TensorFlow Serving, right?

guillaumekln · March 18, 2019, 5:17pm

All single source Transformer models support returning an alignment vector, including during serving. For it to be usable, the model should be trained with guided alignment:

https://opennmt.net/OpenNMT-tf/alignments.html

spatel · March 19, 2019, 4:44pm

About the alignments file. I’m planning to use fast_align to get the alignments. As input files to fast_align should be tokenized, how should I proceed further with this? I already have OpenNMT vocab, SentencePiece tokenizer/vocab and raw data files.

I’ve first tokenize raw files with SentencePiece tokenizer and I’m planning to use it as input to fast_align to get required alignments. But i’m not sure what exactly my input to fast_align should be. I would really appreciate any clarification on this one.

with “▁” in fast-align input???

raw files sample: 
    en.txt: 
        this is a sample test!  
    es.txt: 
        Esta es una prueba de muestra!

SentencePiece tokenized files sample:
    en.sp.txt: 
        ▁this ▁is ▁a ▁sample ▁test !
    es.sp.txt: 
        ▁Esta ▁es ▁una ▁prueba ▁de ▁muestra !

fast_align input file to get allignments:
    ▁this ▁is ▁a ▁sample ▁test ! ||| ▁Esta ▁es ▁una ▁prueba ▁de ▁muestra !

without “▁” in fast-align input???

raw files sample: 
    en.txt: 
        this is a sample test!  
    es.txt: 
        Esta es una prueba de muestra!

SentencePiece tokenized files sample:
    en.sp.txt: 
        ▁this ▁is ▁a ▁sample ▁test !
    es.sp.txt: 
        ▁Esta ▁es ▁una ▁prueba ▁de ▁muestra !

fast_align input file to get allignments:
    this is a sample test ! ||| Esta es una prueba de muestra !

guillaumekln · March 19, 2019, 4:53pm

You should feed the tokenized data to fast_align, so with the ▁ token.

spatel · March 19, 2019, 5:20pm

Just to make sure, I also have to use tokenized data with ▁ token during training, right?

train:
    en.train.txt:
        ▁this ▁is ▁a ▁sample ▁test !
    es.train.txt:
        ▁Esta ▁es ▁una ▁prueba ▁de ▁muestra !

guillaumekln · March 19, 2019, 5:24pm

Yes, that is right.

soumya.cbr · July 2, 2020, 11:48am

Hi spatel, after getting the alignments for the _token text, how are you getting the alignments at word level for post-processing ?

guillaumekln · July 6, 2020, 7:37am

You can try using the function detokenize_with_ranges from OpenNMT/Tokenizer. It provides a basic way to map tokens to ranges in the detokenized text.

anderleich · July 22, 2020, 12:25pm

Hi,
Any example of an implementation to get word level aligments from subword level aligments using detokenize_with_ranges??

It seems it is not performing correctly when joiner occurs after subword:

¿￭ ｟mrk_case_modifier_C｠ es posible ￭?
¿Es posible? {0: (0, 1), 2: (2, 3), 3: (5, 11), 4: (12, 12)}

For ¿ it is counting ￭ also. See (0,1)

guillaumekln · July 22, 2020, 12:29pm

The ranges are in bytes, and indeed the character ¿ is composed of 2 bytes.

anderleich · July 22, 2020, 12:31pm

That seems the reason, as some accents are also beeing counted twice.
How can we know when 2 bytes are counted?

guillaumekln · July 22, 2020, 12:41pm

You should view the text as a sequence of bytes if you want to extract the ranges:

>>> "¿Es posible?".encode("utf-8")[0:1+1].decode("utf-8")
'¿'

We could add an option to return ranges on Unicode characters.

anderleich · July 22, 2020, 12:45pm

Thanks.
That would be helpful for coding issues

guillaumekln · September 2, 2020, 9:17am

For reference, the new argument was added in:

which is available in version 1.19.0.

>>> tokens = ["¿￭", "｟mrk_case_modifier_C｠", "es", "posible", "￭?"]
>>> _, ranges = tokenizer.detokenize_with_ranges(tokens, unicode_ranges=True)
>>> ranges
{0: (0, 0), 2: (1, 2), 3: (4, 10), 4: (11, 11)}