Cleaning and Normalizing WMT Dataset for French to English

Rishi · June 16, 2020, 7:48pm

I am trying to replicate Frech-English Results as defined in the paper “Attention is all you need”. I am new to NLP so facing some doubts.

WMT-14 dataset for English-French consist of 5 different datasets ( Europarl v7, Common Crawl corpus, UN corpus, News Commentary, and 10^9French-English corpus). Total parallel pairs around 40M.
Is this right?? Please correct me if I am wrong.
As the dataset was collected from multiple sources, so I tried to remove the same sentences, and the dataset reduces to 34M. So Does this happen usually?? Are there exactly same sentences in the WMT-14 dataset??
After observation, I found data contains some absurd words like " " (Semes like a kind of delimiter in sentences), So do I need any normalization on the dataset or directly apply OpenNMT tokenizer and then preprocessing followed by training??

Many thanks in advance for the help.