I tokenized my data based on this configurations and train my model.
Now in he translation part i noticed somthing . in my both training and translation text there is words like this and their tokenization:
E1391 tokenised to: ｟mrk_case_modifier_C｠ e ￭1391 E1392 tokenised to: ｟mrk_case_modifier_C｠ e ￭139 ￭2 E1393 tokenised to: ｟mrk_case_modifier_C｠ e ￭1393
so, my model learn to translate
*e ￭139 ￭2* correctly but about the others with four digit it translate them wrong.
do you have any idea why this happend?
since there is this kind of combination (Alphabet+number) in my texts, do you think it might be a better idea to use
segment_numbers: true in my tokenization?
thanks for your help.