I am trying to use the alignment feature (with on the fly tokenization).
I have prepared my alignments using fast_align (in Pharaoh format).
In the YAML file I have configured the below parameters:
data:
train_alignments: train-alignment.txt
params:
guided_alignment_type: ce
guided_alignment_weight: 1
infer:
with_alignments: hard
Then, I implemented guided alignment training.
After training the model, I run the translation command. However the result I get is not accurate at all. In fact, I notice that there are more alignments than expected (meaning that not all alignments correspond to existing tokens).
Do you possibly know what is happening and how can I solve this?
You use on the fly tokenization so the translation output is detokenized. But the returned alignment corresponds to the output tokens before detokenization. I think that’s the issue?
That’s what I thought. Yes, you should tokenize the data before using fast_align and then disable on the fly tokenization (since the data will already be tokenized).
I would like to ask you something more. In case I don’t use on the fly tokenization and instead I tokenize the data before the training, how will I detokenize them at the end of the process?
However, the output of the tokenization and the input of the detokenization is a list of tokens and not a string. If I join all tokens into one string in order to use them for training, how will I detokenize them after all? Will I split them again?