I have been running multiple experiments on BPE, but is seems it always decreases the performance on French to English translation. I have calculated both tokenized ad detokenized scores, but either way I get less BLEU than a baseline.
The pipeline I using is the following :
- Tokenize the data using the script tools/tokenizer.perl
- Learn bpe codes (I tried both shared codes and seperate codes):
OpenNMT-py-master/tools/learn_bpe.py –s 32000 < data.txt.tok> bpe-codes
- Apply BPE codes:
OpenNMT-py-master/tools/apply_bpe.py -c bpe-codes < src-train.txt.tok> train.src.bpe
- Then, openNMT-py preprocess and train
I am using 2.5M sentences and 2 layers dec/enc.