OpenNMT tokenizer

(Maxim Khalilov - #1

Hi all,
We noticed that the standard tokenizer (tools/tokenize.lua) does not have a feature to restore casing of words with mixed casing (WiFi). We found a workaround splitting such words into to parts with consistent casing (WiFi ->Wi Fi) and restoring correct casing on the sub-word basis.

`echo “Abc ABC ABc abC AbC” | th tools/tokenize.lua -joiner_annotate -case_feature -mode aggressive -segment_case | th tools/detokenize.lua -case_feature

Abc ABC ABc abC AbC

Will it possible to include support of mixed casing in the tokenizer/detokenizer?

Link to the original request (GitHub):

(Guillaume Klein) #2

Hi and welcome to the OpenNMT forum!

This has been merged thanks to @kovalevfm. :thumbsup: