How should I clean the texts before training?

anavc94 · February 6, 2020, 1:45pm

Hi,

What kind of rules should I follow in order to clean my texts before training?

I am currently doing these things:

Remove all punctuation and non alphanumeric caracters
Lowercase every first letter in each sentence
Deleting duplicated translations, identical sentences in source and target
Normalize everything to just one scape (i.e, replacing two or more spaces by one)
Deleting numeration such as 1.1, 1.2, etc. or a), b), etc.
Deleting short sentences, as in my case they does not contain relevant info.

Any more ideas? Should I remove decimal numbers? Should I remove parts between brackets? Is there any of the previous steps I should’t do?

Thanks in advance

ymoslem · February 9, 2020, 10:48am

Dear Ana,

Do not delete them, especially if they might be different between source and target.

Actually, the common practice is to lowercase everything. Otherwise, your MT model will not understand words with different cases. Later, you can use a true-caser library to fix the final translation. You can play with my NMT demo to see what I mean. For example, select the French-English model from the drop-down menu and try to translate this sentence (with wrong casing), and check the translation.
nous remercions le commissaire de l'union africaine, m lamamra, de sa déclaration.

Actually, keep them. Numbering is a complicated topic, but at least do not delete them if you want the model to learn them.

I do not know your case, but short sentences helps if they include parts of longer sentences.

You might also want to consider:

removing empty source/target/both segments.
removing segments that have a too long/short source compared to the target.
removing tags that are not important for translation.

You can also check some Best Practices in Translation Memory Management, that might give you some cleaning ideas.

Apart from this, you must tokenize, whether it is word tokenization or sub-word. tokenization.

All the best!
Yasmin

panosk · February 10, 2020, 9:32am

Hi,

Here’s another opinion: If there aren’t any serious vocabulary, syntax, or terminology errors that need to be fixed, do not over-normalize your text. Just clean it from junks that come in large quantities and could negatively affect NMT learning (lots of html tags, chopped segments, etc). Assuming you have a large corpus (which should be the case anyway in order to train a good model), having a bit of “noise” (minor misspellings, few extra spaces, etc) in your training data is always good.
Concerning case handling, I always considered this a major flaw of previous MT technologies. Casing is an important linguistic feature for most languages that affects syntax and meaning. If you lowercase everything, then the MT system cannot learn this feature. So, for example the sentence,

“The animal care institute is located in the old ship district.”

would be translated differently than this sentence by a translator:

“The Animal Care Institute is located in the Old Ship district.”

A human translator will translate everything in the first sentence, but in the second sentence they will be alerted by the title case and correctly assume these are proper names, so they will not translate them.

BPE alleviates the vocabulary sparsity problem and NMT systems actually learn to make the above distinction based on casing.

anavc94 · February 11, 2020, 9:26am

Yasmin,

thank you so much for all your recommendations. I am very new in NPL and I really appreciate it. I would also take a look at the link to the “Best Practices” you provided.

Ana

anavc94 · February 11, 2020, 9:31am

Hi @panosk,

thanks for your contribution, I am very new in NPL and I really appreciate it. Regarding to the case question, I think I am going to train a model where every proper name has its first letter uppercase and see what happens.

Regards,
Ana