Hi guys,
I tokenized my data based on this configurations and train my model.
case_markup: true
joiner_annotate: true
mode: aggressive
preserve_placeholders: true
segment_alphabet_change: true
segment_case: true
sp_model_path: de.wiki.bpe.vs25000.model
Now in he translation part i noticed somthing . in my both training and translation text there is words like this and their tokenization:
E1391 tokenised to: ⦅mrk_case_modifier_C⦆ e ■1391 E1392 tokenised to: ⦅mrk_case_modifier_C⦆ e ■139 ■2 E1393 tokenised to: ⦅mrk_case_modifier_C⦆ e ■1393
so, my model learn to translate *e ■139 ■2*
correctly but about the others with four digit it translate them wrong.
do you have any idea why this happend?
since there is this kind of combination (Alphabet+number) in my texts, do you think it might be a better idea to use segment_numbers: true
in my tokenization?
thanks for your help.