OpenNMT tokenizer

maxkhalilov · June 1, 2017, 11:55am

Hi all,
We noticed that the standard tokenizer (tools/tokenize.lua) does not have a feature to restore casing of words with mixed casing (WiFi). We found a workaround splitting such words into to parts with consistent casing (WiFi ->Wi Fi) and restoring correct casing on the sub-word basis.

Example:
`echo “Abc ABC ABc abC AbC” | th tools/tokenize.lua -joiner_annotate -case_feature -mode aggressive -segment_case | th tools/detokenize.lua -case_feature

Abc ABC ABc abC AbC

Will it possible to include support of mixed casing in the tokenizer/detokenizer?

Link to the original request (GitHub): https://github.com/OpenNMT/OpenNMT/pull/305

guillaumekln · June 2, 2017, 1:45pm

Hi and welcome to the OpenNMT forum!

This has been merged thanks to @kovalevfm.