Issue on Dictionary Formation and translation processes

sai10 · October 23, 2017, 12:13pm

Step 1 - Tokenization of ‘Source.txt’ and ‘Target.txt’ with case_feature and joiner_annotate as OPTIONS

 th tools/tokenize.lua  -case_feature -joiner_annotate < srchn10.txt > srchn10.tok

Here , ‘srchn10.txt’ is the source file containing 1.4 million sentences.

Step 2 - Learnig of BPE model

 th tools/learn_bpe.lua  -size 50000 -save_bpe  hnfritptro.bpe50000 < srchn10.tok

Here , ‘hnfritptro.bpe50000’ is the bpe model

Step 3 - Retokenizing

 th tools/tokenize.lua -case_feature -joiner_annotate  -bpe_case_insensitive  -bpe_model hnfritptro.bpe50000  < srchn10.tok > srchn.bpe50000

LINK TO THE FILES

FILES

After preprocessing , dictionary size is under 1000 . Before when I was doing the process without applyig BPE the dictionary size was around 50,000 . So I want to know the reason behind this? And whether all the processes conducted are correct or not ? Is it good to have a low perplexity score? And lastly How to I improve my BLEU score to 15 % (as of now its around 6%)?

guillaumekln · October 23, 2017, 12:18pm

cc @DYCSystran

guillaumekln · October 23, 2017, 12:27pm

There was an issue when loading a BPE model. Sorry about that.

Could you update to the latest commit and reapply the BPE tokenization on your data (and of course retrain your model)?

sai10 · October 23, 2017, 12:38pm

While retokenization , should we apply the process on the raw text files or on the tokenized files.

guillaumekln · October 23, 2017, 12:42pm

Both are supported. Read the documentation to determine which workflow works better for you:

http://opennmt.net/OpenNMT/tools/tokenization/#bpe

sai10 · October 23, 2017, 12:49pm

And lastly when should we detokenize?

guillaumekln · October 23, 2017, 1:08pm

Just after the translation?

DA1234k · October 23, 2017, 4:54pm

sir i have problem for taking case_feature option if i am not taking case_feature it will detokenized , what is main problem i am updating all of those thing .

Issue on Dictionary Formation and translation processes

Step 1 - Tokenization of ‘Source.txt’ and ‘Target.txt’ with case_feature and joiner_annotate as OPTIONS

Step 2 - Learnig of BPE model

Step 3 - Retokenizing

Step 4 - Preprocessing

Step 5 - Training

Step 5 - Translation

LINK TO THE FILES