Guided alignment and weighted datasets

Hi,

I want to use guided alignmen for handeling <unk> and I use multi sources, and I should set aligned file of my first training source. is that correct?
so in this case, does the line numbers or sentences legnth (in the aligned file) has influenc in the result of handeling <nuk>? I mean, is there any differncs which source file i choose to set for the train-alignments?

Yes.

You should choose the source file for which the alignments were generated. Does it answer your question?


That being said, I would not recommend using guided alignment to handle UNK. It can work but this makes the training procedure more complex and error-prone.

Subword encoding is a first step if you did not get to it yet. You can also look into the --byte_fallback option from SentencePiece to further reduce unknowns.

thank you for your answer

I think I might explain my intention wrong so just fo r being sure that we are in same page: I use subword and SentencePiece but there is still some <unk> tokens during translation. I want to use train_alignments to put source tokens in translation (instead of <unk>), do you still recommend not to use that?

agian I think I’m doing something wrong, I just set the firs source in train_alignments, but I got this error:

ValueError: 1 alignment files were provided, but 8 were expected to match the number of data files

There are ways to avoid most (if not all) UNK using SentencePiece (see for example --byte_fallback). For this reason, alignments are no longer frequently used to handle UNK. Again, it can work but it is more work and the results are sometimes unexpected. You should try both I guess.

Are you using multi-source, or weighted datasets, or both?

  • When using multi-source, you should pass the alignment file of the first source with the target.
  • When using weighted datasets, you should pass one alignment file per dataset.

thank you for your suggestion. I can’t find any refrence about --byte_fallback and how to use it. could you also help me in this? :pray: :pray:

I checked that again, I don’t have any weighted dataset but If i just set first source i get ValueError.

--byte_fallback is an option of the SentencePiece training: https://github.com/google/sentencepiece/blob/master/doc/options.md

I don’t know if they have another documentation for this option or other examples.

Do you have 8 sources? That’s a lot. Can you post your YAML configuration and the model definition you are using?

I use Transformer model and my config file is this:

the whole training dataset is around 17_000_000 sentence. do you think is a lot?

So you are using weighted datasets and not multi-source. Multi-source refers to specific model architectures, read for example:

So here’s what you should do:

When using weighted datasets, you should pass one alignment file per dataset.

thank you for your help.
and sorry for my misunderstaning