What kind of rules should I follow in order to clean my texts before training?
I am currently doing these things:
- Remove all punctuation and non alphanumeric caracters
- Lowercase every first letter in each sentence
- Deleting duplicated translations, identical sentences in source and target
- Normalize everything to just one scape (i.e, replacing two or more spaces by one)
- Deleting numeration such as 1.1, 1.2, etc. or a), b), etc.
- Deleting short sentences, as in my case they does not contain relevant info.
Any more ideas? Should I remove decimal numbers? Should I remove parts between brackets? Is there any of the previous steps I should’t do?
Thanks in advance