Here’s another opinion: If there aren’t any serious vocabulary, syntax, or terminology errors that need to be fixed, do not over-normalize your text. Just clean it from junks that come in large quantities and could negatively affect NMT learning (lots of html tags, chopped segments, etc). Assuming you have a large corpus (which should be the case anyway in order to train a good model), having a bit of “noise” (minor misspellings, few extra spaces, etc) in your training data is always good.
Concerning case handling, I always considered this a major flaw of previous MT technologies. Casing is an important linguistic feature for most languages that affects syntax and meaning. If you lowercase everything, then the MT system cannot learn this feature. So, for example the sentence,
“The animal care institute is located in the old ship district.”
would be translated differently than this sentence by a translator:
“The Animal Care Institute is located in the Old Ship district.”
A human translator will translate everything in the first sentence, but in the second sentence they will be alerted by the title case and correctly assume these are proper names, so they will not translate them.
BPE alleviates the vocabulary sparsity problem and NMT systems actually learn to make the above distinction based on casing.
thank you so much for all your recommendations. I am very new in NPL and I really appreciate it. I would also take a look at the link to the “Best Practices” you provided.
thanks for your contribution, I am very new in NPL and I really appreciate it. Regarding to the case question, I think I am going to train a model where every proper name has its first letter uppercase and see what happens.