Problem in tokenize and detokenize during translation

jalesiyan-hadis · September 16, 2020, 1:40pm

Hi
I trained my DE-EN model. but in translation part there is a problem in tokenize and detokenize

source tokenize config:

type: OpenNMTTokenizer
params:
mode: none
joiner_annotate: true
preserve_placeholders: true
segment_numbers: true
segment_alphabet_change: true
case_feature: false
segment_case: true
case_markup: true
sp_model_path: de.wiki.bpe.vs50000.model

target tokenize config:

type: OpenNMTTokenizer
params:
mode: none
joiner_annotate: true
preserve_placeholders: true
segment_numbers: true
segment_alphabet_change: true
case_feature: false
segment_case: true
case_markup: true
sp_model_path: en.wiki.bpe.vs50000.model

during translation, there is no case_markup in tokenized source for example the following sentence:

Hochleistungs-Mähaufbereiter BiG M 450
tokenized to: ￭hoch ￭leistungs ￭- ￭mä ￭h ￭auf ￭bereiter big m 45 ￭0

also in detokenize it seems nothing happens for example:

■performance big m 45 ■0
detokenized to: ■performance big m 45 ■0

I detokenized with both detokenize.lua and pyonmttok, the result was same.
thank you for your help.

guillaumekln · September 16, 2020, 2:31pm

Hi,

Unfortunately case_markup does not work with the “none” tokenization mode. Here the sentence is simply lowercased and passed as-is to SentencePiece. The casing information of each token is lost.

There is definitely something to improve here. At least the documentation.

■performance big m 45 ■0

Was this produced by the model? The big squares typically mean that the tokenized text was passed again to the tokenization script.

jalesiyan-hadis · September 16, 2020, 3:08pm

thank you for your answer.
yes. I checked it again to be sure. I used opennmt docker for running detokenize.lua but the the target_file.en and final_file.en is exactly the same.

th tools/detokenize.lua <target_file.en> final_file.en

Also I used detokenize command directly in python cli but the result is the same

opennmt.tokenizers.Tokenize. detokenize ( str_list )

I don’t know what could be the problem

but my training corpuses tokenized perfectly with case_markup and ‘none’ mode!
In this case is there any way to use case_annotation in ‘none’ mode? because I need to my model learn to handle uppercase

guillaumekln · September 16, 2020, 4:29pm

During training, did you pass tokenized files and set the tokenization configuration?

I mean, it did not crash but it did not work as you expected because the case markup was not generated.

Currently there is no case annotation in this mode. If you need case markup you should use the default tokenization modes such as “conservative” or “aggressive”.

jalesiyan-hadis · September 16, 2020, 5:21pm

yes I set tokenization configuration in my config file(it is the same that I mentioned in my first post) and passed tokenized files for training.

guillaumekln · September 16, 2020, 5:27pm

Ok this is issue. If you tokenize the data before the training, there is no need to tell OpenNMT-tf about your tokenization options. Here the training tokenized your data a second time!

You probably want to remove the tokenization options from the YAML file, and retrain.

jalesiyan-hadis · September 17, 2020, 8:41am

what silly mistake ,
thank you for help