Guided Alignment Training

Hello,

I am trying to use the alignment feature (with on the fly tokenization).

I have prepared my alignments using fast_align (in Pharaoh format).

In the YAML file I have configured the below parameters:

data:
train_alignments: train-alignment.txt

params:
guided_alignment_type: ce
guided_alignment_weight: 1

infer:
with_alignments: hard

Then, I implemented guided alignment training.

After training the model, I run the translation command. However the result I get is not accurate at all. In fact, I notice that there are more alignments than expected (meaning that not all alignments correspond to existing tokens).

Do you possibly know what is happening and how can I solve this?

Thank you.

Hi,

You use on the fly tokenization so the translation output is detokenized. But the returned alignment corresponds to the output tokens before detokenization. I think that’s the issue?

Hello,

Thank you a lot for your prompt reply.

Yes, indeed. How could I solve this?

Moreover, I did not tokenize the data before I run the fast_align. Do I need to run tokeization before fast_align?

Thank you.

That’s what I thought. Yes, you should tokenize the data before using fast_align and then disable on the fly tokenization (since the data will already be tokenized).

Perfect! Thank you a lot :slight_smile:

Hello @guillaumekln

I would like to ask you something more. In case I don’t use on the fly tokenization and instead I tokenize the data before the training, how will I detokenize them at the end of the process?

Thank you

You can use the tool of your choosing. The script onmt-detokenize-text installed by OpenNMT-tf can help.

Hello,

For the tokenization, I am using pyonmttok in Python.

import pyonmttok
tokenizer = pyonmttok.Tokenizer(“conservative”, joiner_annotate=True)
tokens, _ = tokenizer.tokenize(“Hello World!”)
tokens
[‘Hello’, ‘World’, ‘■!’]

tokenizer.detokenize(tokens)
‘Hello World!’

However, the output of the tokenization and the input of the detokenization is a list of tokens and not a string. If I join all tokens into one string in order to use them for training, how will I detokenize them after all? Will I split them again?

Thank you

Maybe you can use the tokenize_file method instead?