We noticed that the standard tokenizer (tools/tokenize.lua) does not have a feature to restore casing of words with mixed casing (WiFi). We found a workaround splitting such words into to parts with consistent casing (WiFi ->Wi Fi) and restoring correct casing on the sub-word basis.
`echo “Abc ABC ABc abC AbC” | th tools/tokenize.lua -joiner_annotate -case_feature -mode aggressive -segment_case | th tools/detokenize.lua -case_feature
Abc ABC ABc abC AbC
Will it possible to include support of mixed casing in the tokenizer/detokenizer?
Link to the original request (GitHub): https://github.com/OpenNMT/OpenNMT/pull/305