I am doing some experiments using bpe model. My training corpus originally is trucased and not tokenized.
To implement translation engine, first I train the bpe model using original corpus.
Later I tokenize training, dev and test using
The problem is that as the bpe model is trained on original corpus, the bpe model contains words as: “t .”, “s ,”, "you ’ " etc. Words contains punctuation marks.
After training, I test my model using the following command:
th tools/rest_translation_server.lua -gpuid 1 -model _epoch19_3.45.t7 -case_feature -port 9000 -host 0.0.0.0 -joiner_annotate -bpe_model bpe_config.en -segment_numbers
The tokenization is different in the following examples:
cats,dogs,frogs -> cat￭￨L s￭￨L ,￭￨N dog￭￨L s￭￨L ,￭￨N frogs￨L
cats, dogs, frogs -> cats￨L ￭,￨N dogs￨L ￭,￨N frogs￨L
The tokenization change if there are spaces after comma or not. It seems that the tokenizer.lua first apply the bpe model and later makes typical tokenization with white spaces. Shouldn’t it be on the contrary? First tokenize and later apply bpe model?
Maybe should I train the bpe model on tokenized, lowercased corpus to avoid words as “t .”?