BPE vs Word tokenization for Preprocessing

Rishi · June 13, 2020, 1:55pm

I am a bit confused with data preprocessing required to replicate the same results as defined in paper “Attention is all you need” on English-French Language using the WMT dataset.

Do I need to apply word tokenization, and then BPE or just alone BPE on sentence pair is enough?

After applying BPE, I want to use this script:

“python3 preprocess.py -train_src data/BPE/train.src -train_tgt data/BPE/train.tgt -valid_src data/BPE/val.src -valid_tgt data/BPE/val.tgt -save_data data/BPE/en_fr -src_vocab_size 100000 -tgt_vocab_size 100000”

And after this, I will train the transformer and translate it.

I am doing, right?

Many thanks in advance for the help.

francoishernandez · June 15, 2020, 9:32am