OpenNMT tokenizer


(Maxim Khalilov - Booking.com) #1

Hi all,
We noticed that the standard tokenizer (tools/tokenize.lua) does not have a feature to restore casing of words with mixed casing (WiFi). We found a workaround splitting such words into to parts with consistent casing (WiFi ->Wi Fi) and restoring correct casing on the sub-word basis.

Example:
`echo “Abc ABC ABc abC AbC” | th tools/tokenize.lua -joiner_annotate -case_feature -mode aggressive -segment_case | th tools/detokenize.lua -case_feature

Abc ABC ABc abC AbC

Will it possible to include support of mixed casing in the tokenizer/detokenizer?

Link to the original request (GitHub): https://github.com/OpenNMT/OpenNMT/pull/305


(Guillaume Klein) #2

Hi and welcome to the OpenNMT forum!

This has been merged thanks to @kovalevfm. :thumbsup: