Guided alignment and weighted datasets

jalesiyan-hadis · November 26, 2020, 10:19am

Hi,

I want to use guided alignmen for handeling <unk> and I use multi sources, and I should set aligned file of my first training source. is that correct?
so in this case, does the line numbers or sentences legnth (in the aligned file) has influenc in the result of handeling <nuk>? I mean, is there any differncs which source file i choose to set for the train-alignments?

guillaumekln · November 26, 2020, 8:29pm

Yes.

You should choose the source file for which the alignments were generated. Does it answer your question?

That being said, I would not recommend using guided alignment to handle UNK. It can work but this makes the training procedure more complex and error-prone.

Subword encoding is a first step if you did not get to it yet. You can also look into the --byte_fallback option from SentencePiece to further reduce unknowns.

jalesiyan-hadis · November 27, 2020, 8:21am

thank you for your answer

I think I might explain my intention wrong so just fo r being sure that we are in same page: I use subword and SentencePiece but there is still some <unk> tokens during translation. I want to use train_alignments to put source tokens in translation (instead of <unk>), do you still recommend not to use that?

agian I think I’m doing something wrong, I just set the firs source in train_alignments, but I got this error:

ValueError: 1 alignment files were provided, but 8 were expected to match the number of data files

guillaumekln · November 27, 2020, 10:49am

There are ways to avoid most (if not all) UNK using SentencePiece (see for example --byte_fallback). For this reason, alignments are no longer frequently used to handle UNK. Again, it can work but it is more work and the results are sometimes unexpected. You should try both I guess.

Are you using multi-source, or weighted datasets, or both?

When using multi-source, you should pass the alignment file of the first source with the target.
When using weighted datasets, you should pass one alignment file per dataset.

jalesiyan-hadis · November 27, 2020, 12:10pm

thank you for your suggestion. I can’t find any refrence about --byte_fallback and how to use it. could you also help me in this?

I checked that again, I don’t have any weighted dataset but If i just set first source i get ValueError.

guillaumekln · November 27, 2020, 12:52pm

--byte_fallback is an option of the SentencePiece training: https://github.com/google/sentencepiece/blob/master/doc/options.md

I don’t know if they have another documentation for this option or other examples.

Do you have 8 sources? That’s a lot. Can you post your YAML configuration and the model definition you are using?

jalesiyan-hadis · November 27, 2020, 1:24pm

I use Transformer model and my config file is this:

eval:
batch_size: 32
batch_type: examples
early_stopping:
metric: loss
min_improvement: 0.01
steps: 20
export_format: saved_model
export_on_best: bleu
external_evaluators: BLEU
steps: 5000
gpu_allow_growth: true
infer:
batch_size: 32
batch_type: examples
params:
beam_width: 2
length_penalty: 0.2
optimizer: Adam
score: null
train:
batch_type: tokens
keep_checkpoint_max: 8
max_step: 300000
sample_buffer_size: 200000
save_checkpoints_steps: 500
data:
eval_features_file: /Projects/ONMT/Models-deen/Data/training/general/tok/general.test.tok.de
eval_labels_file: /Projects/ONMT/Models-deen/Data/training/general/tok/general.test.tok.en
source_embedding:
case_insensitive: false
path: /Projects/Downloads/Word-Embedding/BPEmb/de/50000/de.wiki.bpe.vs50000.d300.w2v.txt
trainable: true
with_header: true
source_vocabulary: /Projects/ONMT/Models-deen/Data/training/general/tok/vocab_32.de
target_embedding:
case_insensitive: false
path: /Projects/Downloads/Word-Embedding/BPEmb/en/en.wiki.bpe.vs50000.d300.w2v.txt
trainable: true
with_header: true
target_vocabulary: /Projects/ONMT/Models-deen/Data/training/general/tok/vocab_32.en
train_alignments: /Projects/ONMT/Models-deen/Data/training/general/tok/aligned/EUbookshop.de-en.train.tok.fast_align

train_features_file:

/Projects/ONMT/Models-deen/Data/training/general/tok/EUbookshop.de-en.train.tok.de

/Projects/ONMT/Models-deen/Data/training/general/tok/computer.train.tok.de

/Projects/ONMT/Models-deen/Data/training/general/tok/QED.de-en.train.tok.de

/Projects/ONMT/Models-deen/Data/training/general/tok/TED2013.de-en.tok.de

/Projects/ONMT/Models-deen/Data/training/general/tok/TildeMODEL.de-en.train.tok.de

/Projects/ONMT/Models-deen/Data/training/general/tok/WMT-News.de-en.tok.de

/Projects/ONMT/Models-deen/Data/training/general/tok/Tatoeba.de-en.tok.de

/Projects/ONMT/Models-deen/Data/training/general/tok/Wikipedia.de-en.train.tok.de
train_labels_file:

/Projects/ONMT/Models-deen/Data/training/general/tok/EUbookshop.de-en.train.tok.en

/Projects/ONMT/Models-deen/Data/training/general/tok/computer.train.tok.en

/Projects/ONMT/Models-deen/Data/training/general/tok/QED.de-en.train.tok.en

/Projects/ONMT/Models-deen/Data/training/general/tok/TED2013.de-en.tok.en

/Projects/ONMT/Models-deen/Data/training/general/tok/TildeMODEL.de-en.train.tok.en

/Projects/ONMT/Models-deen/Data/training/general/tok/WMT-News.de-en.tok.en

/Projects/ONMT/Models-deen/Data/training/general/tok/Tatoeba.de-en.tok.en

/Projects/ONMT/Models-deen/Data/training/general/tok/Wikipedia.de-en.train.tok.en

the whole training dataset is around 17_000_000 sentence. do you think is a lot?

guillaumekln · November 27, 2020, 1:32pm

So you are using weighted datasets and not multi-source. Multi-source refers to specific model architectures, read for example:

So here’s what you should do:

When using weighted datasets, you should pass one alignment file per dataset.

jalesiyan-hadis · November 27, 2020, 1:37pm

thank you for your help.
and sorry for my misunderstaning