As the corpus for training maybe from various kinds of data, the quality of these data will affect the training result. Do you have any suggestion for cleaning up corpus before training?
E.g: Remove garbage characters
Remove XML tags
…
Hi @liluhao1982 - I have on my todo list to write a tutorial on cleaning up corpus for training - it is a quite a long topic but few rules of thumb:
normalize usage of inner tags/placeholders (it is not necessarily a good idea to completely remove them, if you foresee inner tags in your translation task)
clean/normalize as much as you can your target corpus (punctuations, case, quotes, spacing, spelling variants, even even possibly locale),
and on the other hand “unclean” as much as you can your source corpus - or if you do normalize, make sure you can apply the same process at translation time
Did you have time to look into it? I am now trying to figure out how to best prepare the corpora for ONMT training, and am in doubt what steps I should take. As an experiment, I followed this procedure for TMX files:
Removed all tags to get plain text segments (I understand that NMT does not provide ANY flexible means to have markup inside the segments - it is not cool to have one segment with different tags (as plain text!) in the training corpus)
Ran a search and replace script to insert spaces around punctuation and removed leading/trailing whitespaces and shrank multiple whitespaces into single ones.
What is the best way to prep TMX files for NMT translation? How to best normalize tags (is there any script ready for that? If not, what transformation to apply to tags (native TMX tags or XML (deserialized) ones - remove, replace with anything)? Shall I keep a map of the tags with any placeholders to be able to actually restore the real tags after ONMT training/translation? Where exactly is it best to insert spaces (in other words, what are the optimal tokenization rules?)
What is the way to start enhancing NMT output quality after trying the basic Quickstart tutorial? Shall I download WMT16 parallel and monolingual files (and, in general, collect and compile parallel/monolingual corpora) for a “generic” model? (Note I found out that europarl-v7.de-en.de and europarl-v7.de-en.en have mismatching number of lines, not sure I can trust those). I understand I should just collect any corpora I can get, tokenize, create a generic model, and then use a customer’s corpus to “tune” the generic model, when I plainly use a customer’s corpus, I see lots of <unk> in my translations (even if I take the segments for translation from the validation corpus part).
Have you written this tutorial already? If yes, can you please share?
I have a question about normalization and punctuation.
Is it a good practice to remove punctuation altogether? Should you lower-case all texts in both source and target? How the neural network will learn to render the correct punctuation or casing if it doesn’t see it in the training data?