Opennmt-tf tokenizer using case_feature issue

yms9654 · June 28, 2019, 1:03am

This is my command and config file.

$ onmt-build-vocab --tokenizer_config ./agg.yml --size 50000 --save_vocab  test.tok test.txt

agg.yml

mode: aggressive
joiner_annotate: true
segment_numbers: true
segment_alphabet_change: true
case_feature: true

test.txt

WiFi
korea
KOREA

test.tok

<blank>
<s>
</s>
korea
wifi

I think both word korea and KOREA should be in test.tok because case_feature: true
but it is not.
I wonder where is case_feature infomation?

guillaumekln · June 28, 2019, 8:03am

When case_feature is used in OpenNMT-tf, the tokens are lowercased but the case feature is ignored. So the vocabulary file is expected.

jalesiyan-hadis · March 2, 2020, 4:04pm

Hi,
in this case, is there any way to use case_feature in opennmt-tf?

guillaumekln · March 2, 2020, 4:18pm

Hi,

If you want a case “feature” in target, you should look into the alternative case_markup flag:

https://github.com/OpenNMT/Tokenizer/blob/master/docs/options.md#case_markup-boolean-default-false

jalesiyan-hadis · June 25, 2020, 9:40am

Hi.
It might be a silly queation but I’m a little bit confused.
you mensioned:

if I want a case feature in target

did you mean I just have to use case_markup flag in my target source?
I want to train a model from German to English. In german language the first character of
all nouns is capital.So I’m a little bit confused what is the best approach for tokenization.

thank you for your help.

guillaumekln · June 25, 2020, 10:02am

Hi,

You could use case_markup only for the source if you want to.