In Moses and various forks like ModerMT, the concept of “punctuation normalization” and “tokenization” is very language specific.
For instance, there is a step of punctuation normalization to convert french quotes to standard quotes.
Same for various apostrophes.
Also, apostrophe is not tokenized the same way in english and in french.
How do you guys compare to the “Moses”-like tokenizer/detokenizer perl script ?
Also I find it a little confusing the way the casing options are presented here:
-case_feature: generate case feature - and convert all tokens to lowercase
N: not defined (for instance tokens without case)
L: token is lowercased (opennmt)
U: token is uppercased (OPENNMT)
C: token is capitalized (Opennmt)
M: token case is mixed (OpenNMT)
What is the default behavior ?
Can we train in “truecasing” mode, ie leave casing as is in corpus, except the first word of sentence which is modified to its most likely form ?