I am a bit confused with data preprocessing required to replicate the same results as defined in paper “Attention is all you need” on English-French Language using the WMT dataset.
Do I need to apply word tokenization, and then BPE or just alone BPE on sentence pair is enough?
After applying BPE, I want to use this script:
“python3 preprocess.py -train_src data/BPE/train.src -train_tgt data/BPE/train.tgt -valid_src data/BPE/val.src -valid_tgt data/BPE/val.tgt -save_data data/BPE/en_fr -src_vocab_size 100000 -tgt_vocab_size 100000”
And after this, I will train the transformer and translate it.
I am doing, right?
Many thanks in advance for the help.