Hi @liluhao1982 - I have on my todo list to write a tutorial on cleaning up corpus for training - it is a quite a long topic but few rules of thumb:
- normalize usage of inner tags/placeholders (it is not necessarily a good idea to completely remove them, if you foresee inner tags in your translation task)
- clean/normalize as much as you can your target corpus (punctuations, case, quotes, spacing, spelling variants, even even possibly locale),
- and on the other hand "unclean" as much as you can your source corpus - or if you do normalize, make sure you can apply the same process at translation time
I will try to write a dedicated tutorial soon...