Hello…
I have been trying to use BPE-dropout during training with on-the-fly tokenization. Training looks good with the validation ppl dropping. But, on testing (during translate) the output it generates has a mix of source and target language, with majority tokens belonging to the source language rather than the target language.
For example, for a English->Assamese model ,
if the src sent (BPE applied) is :
" farming is the primary occu@@ pation of man ."
target output (with BPE applied) should be:
“কৃষি মানুহৰ এক প্ৰধান জীৱ@@ িকা ।”
I cannot figure out the error…
I used the following in the train.yaml file:
Tokenization options
src_subword_type: bpe
src_subword_model: path to the source language BPE codes file
tgt_subword_type: bpe
tgt_subword_model: path to the target language BPE codes file
#smoothing parameter for sentencepiece regularization / dropout probability for BPE, source side
src_subword_alpha: 0.1
#smoothing parameter for sentencepiece regularization / dropout probability for BPE, target side
tgt_subword_alpha: 0.1
For translating I had used the standard BPE (no dropout).
The OpenNMT-py documentattion provides the procedure for Sentencepiece, but cannot find any direct tutorial for BPE-Dropout.
Please Help…I have been trying desperately since the past three days.
This looks like your data is not properly tokenized (tokenized too much actually).
Also, using onmt_tokenize with spacer_annotate, you should not have @@ joiners, but ▁ spacers. So either your data is pre-tokenized with subword-nmt, or maybe you set the bpe transform instead of onmt_tokenize.
Basically, your tokenization options are not right. You might want to try and find the proper options you really want to use by looking at the Tokenizer docs, and then adapt your config accordingly.
Also, general comment, if you want to maximize your chance to get help, post your FULL config.
src_vocab: path to source vocab
tgt_vocab: path to target vocab
overwrite: False
src_seq_length: 200
tgt_seq_length: 200
data:
corpus_1:
path_src: path to RAW source file
path_tgt: path to RAW target file
transforms: [onmt_tokenize, filtertoolong]
valid:
path_src: path to RAW source file
path_tgt: path to RAW target file
transforms: [onmt_tokenize, filtertoolong]
Inference
Problem: I had pretokenised the TEST file with subword-nmt which gave me the output with @@ as joiner → doesn’t seem to work.
Questions:
Is it correct to use ‘spacer_annotate’ while training with BPE_dropout or should we go with ‘joiner_annotate’ ?
How should I use onmt-tokenizer during inference? I did try (for the very first time, I am familiar with using subword-nmt) as follows: