Issue on Dictionary Formation and translation processes

(Acharya) #1

Step 1 - Tokenization of ‘Source.txt’ and ‘Target.txt’ with case_feature and joiner_annotate as OPTIONS

 th tools/tokenize.lua  -case_feature -joiner_annotate < srchn10.txt > srchn10.tok

Here , ‘srchn10.txt’ is the source file containing 1.4 million sentences.

Step 2 - Learnig of BPE model

 th tools/learn_bpe.lua  -size 50000 -save_bpe  hnfritptro.bpe50000 < srchn10.tok

Here , ‘hnfritptro.bpe50000’ is the bpe model

Step 3 - Retokenizing

 th tools/tokenize.lua -case_feature -joiner_annotate  -bpe_case_insensitive  -bpe_model hnfritptro.bpe50000  < srchn10.tok > srchn.bpe50000 

Step 4 - Preprocessing


Step 5 - Training


Step 5 - Translation




After preprocessing , dictionary size is under 1000 . Before when I was doing the process without applyig BPE the dictionary size was around 50,000 . So I want to know the reason behind this? And whether all the processes conducted are correct or not ? Is it good to have a low perplexity score? And lastly How to I improve my BLEU score to 15 % (as of now its around 6%)?

(Guillaume Klein) #2

cc @DYCSystran

(Guillaume Klein) #3

There was an issue when loading a BPE model. Sorry about that.

Could you update to the latest commit and reapply the BPE tokenization on your data (and of course retrain your model)?

(Acharya) #4

While retokenization , should we apply the process on the raw text files or on the tokenized files.

(Guillaume Klein) #5

Both are supported. Read the documentation to determine which workflow works better for you:

(Acharya) #6

And lastly when should we detokenize?

(Guillaume Klein) #7

Just after the translation?

(Aditya) #8

sir i have problem for taking case_feature option if i am not taking case_feature it will detokenized , what is main problem i am updating all of those thing .