I want to use guided alignmen for handeling <unk> and I use multi sources, and I should set aligned file of my first training source. is that correct?
so in this case, does the line numbers or sentences legnth (in the aligned file) has influenc in the result of handeling <nuk>? I mean, is there any differncs which source file i choose to set for the train-alignments?
You should choose the source file for which the alignments were generated. Does it answer your question?
That being said, I would not recommend using guided alignment to handle UNK. It can work but this makes the training procedure more complex and error-prone.
Subword encoding is a first step if you did not get to it yet. You can also look into the --byte_fallback option from SentencePiece to further reduce unknowns.
I think I might explain my intention wrong so just fo r being sure that we are in same page: I use subword and SentencePiece but there is still some <unk> tokens during translation. I want to use train_alignments to put source tokens in translation (instead of <unk>), do you still recommend not to use that?
agian I think I’m doing something wrong, I just set the firs source in train_alignments, but I got this error:
ValueError: 1 alignment files were provided, but 8 were expected to match the number of data files
There are ways to avoid most (if not all) UNK using SentencePiece (see for example --byte_fallback). For this reason, alignments are no longer frequently used to handle UNK. Again, it can work but it is more work and the results are sometimes unexpected. You should try both I guess.
Are you using multi-source, or weighted datasets, or both?
When using multi-source, you should pass the alignment file of the first source with the target.
When using weighted datasets, you should pass one alignment file per dataset.