Step 1 - Tokenization of ‘Source.txt’ and ‘Target.txt’ with case_feature and joiner_annotate as OPTIONS
th tools/tokenize.lua -case_feature -joiner_annotate < srchn10.txt > srchn10.tok
Here , ‘srchn10.txt’ is the source file containing 1.4 million sentences.
Step 2 - Learnig of BPE model
th tools/learn_bpe.lua -size 50000 -save_bpe hnfritptro.bpe50000 < srchn10.tok
Here , ‘hnfritptro.bpe50000’ is the bpe model
Step 3 - Retokenizing
th tools/tokenize.lua -case_feature -joiner_annotate -bpe_case_insensitive -bpe_model hnfritptro.bpe50000 < srchn10.tok > srchn.bpe50000
Step 4 - Preprocessing
Step 5 - Training
Step 5 - Translation
LINK TO THE FILES
After preprocessing , dictionary size is under 1000 . Before when I was doing the process without applyig BPE the dictionary size was around 50,000 . So I want to know the reason behind this? And whether all the processes conducted are correct or not ? Is it good to have a low perplexity score? And lastly How to I improve my BLEU score to 15 % (as of now its around 6%)?