Bleu score falling when detokenizing, detruecasing and de subword BPE

Earlier I was calculating my BLeu score on the final output of the model translation which is tokenized, truecased and subword BPE(my referenced data is also tokenized, truecased and BPE). In this case I got Bleu score of about 0.24. But when I calculate my Bleu score after detokenizing, detruecasing and de BPE and referenced text as plain text, i am getting bleu score of 0.19. Why is this fall?
I wish to compare my bleu score of my translation vs google translation api . Google translation API returns plain text, so I also made my output into plain text using above post processing methodology. But then my score is falling and I am not able to understand it. Can anyone guide?

When computing BLEU on detokenized texts, you should use a tool that includes a standard tokenization such as sacreBLEU or multi-bleu-detok.perl.

hi guillaumekln, I am using the multi-bleu detok-perl script to calculate my bleu score on detokenized data. Is it the decoding using BPE which is affecting my Bleu score?

You should simply not compare tokenized and detokenized BLEU scores. They are not comparable.

okay,thanks guillaumekln,
Now I am trying to experiment with sentencepiece. Can you tell me for training the sentencepiece model, do i need to generate two models one for source(my case hindi) and one for target(english) and then tokenize respective data. Or I can just combine my sorce and target into one text file separated by tab and generate one single model which I will use to tokenize all corpora. In both cases what should be optimum vocab size? I am having corpora of arounf 1m parallel sentences


Both approaches are viable

You can generate two models and apply them respectively to source and target data, or generate one model and apply it to both. However for the latter I would not use tab to separate my source and target file, just concatenate them

Also I would recommend to use SentencePiece on already tokenized data.

I use Moses tokenizer, then SP (BPE algorithm) with the first approach (one model per language) and it works very well

Thanks valentin for your response.
so basically my flow is going to be like this-
I tokenize source(hindi) using indic nlp, target(english) using moses. Then I generate two individual sentencepiece model and apply it to my src, target respectively. I feed those to my model for training.
The i translate my test sentences(hindi) which are tokenized and encoded using sentencepiece english model. Then I decode my final output ? and do I need to truecase my english sentences before encoding using sentencepiece ?? PS: I am keeping my vocab size as 10k as it gave me good result with subword nmt

Then i translate my test sentences(hindi) which are tokenized and encoded using sentencepiece english model

Aren’t they encoded using SP hindi model instead of english ?

Then I decode my final output ?

Yes, if spm decode doesn’t works well you can replace _ by a whitespace and remove whitespaces

I personally do not alter the case at any time

pardon, its hindi model only.
Thanks valentin , I am putting my model for training. Will share my results once done

Hi valentin,
This is the kind of output I am getting after using sentencepiece and the translating.
[‘▁in’, ‘▁Canada’, ‘▁&’, ‘apos’, ‘;’, ‘▁s’, ‘▁aer’, ‘opl’, ‘ane’, ‘▁and’, ‘▁Tra’, ‘ine’, ‘e’, ‘▁,’, ‘▁B’, ‘om’, ‘ard’, ‘ian’, ‘▁Inc’, ‘▁reported’, ‘▁a’, ‘▁decline’, ‘▁of’, ‘▁15’, ‘▁per’, ‘▁cent’, ‘▁in’, ‘▁the’, ‘▁net’, ‘▁profit’, ‘▁to’, ‘▁Gur’, ‘day’, ‘▁,’, ‘▁in’, ‘▁its’, ‘▁railway’, ‘▁unit’, ‘▁,’, ‘▁in’, ‘▁the’, ‘▁third’, ‘▁quarter’, ‘▁and’, ‘▁lower’, ‘▁air’, ‘▁orders’, ‘▁on’, ‘▁contract’, ‘▁issues’, ‘▁.’]

Problem is, in some out put I am not having [ ] square brackets, don’t know why.And because of this I am not able to decode, its giving error.
‘▁decline’, ‘▁in’, ‘▁the’, ‘▁profits’, ‘▁of’, ‘▁the’, ‘▁B’, ‘om’, ‘inder’, ‘▁as’, ‘▁the’, ‘▁delivery’, ‘▁of’, ‘▁the’, ‘▁air’, ‘port’, ‘▁,’, ‘▁loss’, ‘▁of’, ‘▁orders’, ‘▁.’]

And one more problem is I am getting [ bracket at random place in the out put, which stops my sp.DecodePieces method

Is this coming from OpenNMT-py? Look for a pred.txt file.

yes, I am using sentencepiece for encoding and decoding

Hi Valentin,
In my case Subword NMT and Sentencepiece are giving approx same Bleu scores. And by visual inspection also they are more or less similar. Some output are better for subword nmt, some are better for Sentencepiece. Quite difficult to choose one


Thanks for sharing, could you tell us what parameters you’ve used with Subword NMT and SentencePiece ?

For Subword I used 10k merge operation and for Sentencepiece I used 10k vocab size which gave me around 32k vocab size for my nmt model. I have around 1M parallel corpus is this the optimum vocab size? Also, shouldn’t sentencepiece outperform subword ?

Glad I found this discussion. With my Turkish-English Transformer model I just got a BLEU of 65.96 on my SentencePiece encoded BPE predictions (10000 sentences) and a drop to 59.11 after decoding the text and reference.