Different subword tokenization in same word pattern

jalesiyan-hadis · November 20, 2020, 11:44am

Hi guys,
I tokenized my data based on this configurations and train my model.

case_markup: true
joiner_annotate: true
mode: aggressive
preserve_placeholders: true
segment_alphabet_change: true
segment_case: true
sp_model_path: de.wiki.bpe.vs25000.model

Now in he translation part i noticed somthing . in my both training and translation text there is words like this and their tokenization:

E1391  tokenised to:  ｟mrk_case_modifier_C｠ e ￭1391
E1392   tokenised to:  ｟mrk_case_modifier_C｠ e ￭139 ￭2
E1393   tokenised to:  ｟mrk_case_modifier_C｠ e ￭1393

so, my model learn to translate *e ￭139 ￭2* correctly but about the others with four digit it translate them wrong.
do you have any idea why this happend?
since there is this kind of combination (Alphabet+number) in my texts, do you think it might be a better idea to use segment_numbers: true in my tokenization?

thanks for your help.

guillaumekln · November 20, 2020, 1:06pm

Hi,

You can often find the explanation in your training data. For example one pattern could be more frequent than the others.

Yes.

jalesiyan-hadis · November 20, 2020, 1:11pm

Thank you for your answer.

so if it is the reason, should using segment_numbers: true solve this problem? or I also , need to do something else about it?

guillaumekln · November 20, 2020, 1:15pm

segment_numbers should solve this issue. The model will learn to copy a sequence of digits.