I want to use guided alignmen for handeling <unk> and I use multi sources, and I should set aligned file of my first training source. is that correct?
so in this case, does the line numbers or sentences legnth (in the aligned file) has influenc in the result of handeling <nuk>? I mean, is there any differncs which source file i choose to set for the train-alignments?
I think I might explain my intention wrong so just fo r being sure that we are in same page: I use subword and SentencePiece but there is still some <unk> tokens during translation. I want to use train_alignments to put source tokens in translation (instead of <unk>), do you still recommend not to use that?
agian I think I’m doing something wrong, I just set the firs source in train_alignments, but I got this error:
ValueError: 1 alignment files were provided, but 8 were expected to match the number of data files
There are ways to avoid most (if not all) UNK using SentencePiece (see for example --byte_fallback). For this reason, alignments are no longer frequently used to handle UNK. Again, it can work but it is more work and the results are sometimes unexpected. You should try both I guess.
Are you using multi-source, or weighted datasets, or both?
When using multi-source, you should pass the alignment file of the first source with the target.
When using weighted datasets, you should pass one alignment file per dataset.