Do not delete them, especially if they might be different between source and target.
Actually, the common practice is to lowercase everything. Otherwise, your MT model will not understand words with different cases. Later, you can use a true-caser library to fix the final translation. You can play with my NMT demo to see what I mean. For example, select the French-English model from the drop-down menu and try to translate this sentence (with wrong casing), and check the translation. nous remercions le commissaire de l'union africaine, m lamamra, de sa déclaration.
Actually, keep them. Numbering is a complicated topic, but at least do not delete them if you want the model to learn them.
I do not know your case, but short sentences helps if they include parts of longer sentences.
You might also want to consider:
removing empty source/target/both segments.
removing segments that have a too long/short source compared to the target.
removing tags that are not important for translation.
Here’s another opinion: If there aren’t any serious vocabulary, syntax, or terminology errors that need to be fixed, do not over-normalize your text. Just clean it from junks that come in large quantities and could negatively affect NMT learning (lots of html tags, chopped segments, etc). Assuming you have a large corpus (which should be the case anyway in order to train a good model), having a bit of “noise” (minor misspellings, few extra spaces, etc) in your training data is always good.
Concerning case handling, I always considered this a major flaw of previous MT technologies. Casing is an important linguistic feature for most languages that affects syntax and meaning. If you lowercase everything, then the MT system cannot learn this feature. So, for example the sentence,
“The animal care institute is located in the old ship district.”
would be translated differently than this sentence by a translator:
“The Animal Care Institute is located in the Old Ship district.”
A human translator will translate everything in the first sentence, but in the second sentence they will be alerted by the title case and correctly assume these are proper names, so they will not translate them.
BPE alleviates the vocabulary sparsity problem and NMT systems actually learn to make the above distinction based on casing.
thanks for your contribution, I am very new in NPL and I really appreciate it. Regarding to the case question, I think I am going to train a model where every proper name has its first letter uppercase and see what happens.