OpenNMT Forum

Cleaning and Normalizing WMT Dataset for French to English

I am trying to replicate Frech-English Results as defined in the paper “Attention is all you need”. I am new to NLP so facing some doubts.

  1. WMT-14 dataset for English-French consist of 5 different datasets ( Europarl v7, Common Crawl corpus, UN corpus, News Commentary, and 10^9French-English corpus). Total parallel pairs around 40M.
    Is this right?? Please correct me if I am wrong.

  2. As the dataset was collected from multiple sources, so I tried to remove the same sentences, and the dataset reduces to 34M. So Does this happen usually?? Are there exactly same sentences in the WMT-14 dataset??

  3. After observation, I found data contains some absurd words like " " (Semes like a kind of delimiter in sentences), So do I need any normalization on the dataset or directly apply OpenNMT tokenizer and then preprocessing followed by training??

Many thanks in advance for the help.