Different subword tokenization in same word pattern

Hi guys,
I tokenized my data based on this configurations and train my model.

case_markup: true
joiner_annotate: true
mode: aggressive
preserve_placeholders: true
segment_alphabet_change: true
segment_case: true
sp_model_path: de.wiki.bpe.vs25000.model

Now in he translation part i noticed somthing . in my both training and translation text there is words like this and their tokenization:

E1391  tokenised to:  ⦅mrk_case_modifier_C⦆ e ■1391
E1392   tokenised to:  ⦅mrk_case_modifier_C⦆ e ■139 ■2
E1393   tokenised to:  ⦅mrk_case_modifier_C⦆ e ■1393

so, my model learn to translate *e ■139 ■2* correctly but about the others with four digit it translate them wrong.
do you have any idea why this happend?
since there is this kind of combination (Alphabet+number) in my texts, do you think it might be a better idea to use segment_numbers: true in my tokenization?

thanks for your help.

Hi,

You can often find the explanation in your training data. For example one pattern could be more frequent than the others.

Yes.

Thank you for your answer.

so if it is the reason, should using segment_numbers: true solve this problem? or I also , need to do something else about it?

segment_numbers should solve this issue. The model will learn to copy a sequence of digits.

1 Like