Bleu score falling when detokenizing, detruecasing and de subword BPE

ajitesh3 · May 20, 2019, 7:24pm

Earlier I was calculating my BLeu score on the final output of the model translation which is tokenized, truecased and subword BPE(my referenced data is also tokenized, truecased and BPE). In this case I got Bleu score of about 0.24. But when I calculate my Bleu score after detokenizing, detruecasing and de BPE and referenced text as plain text, i am getting bleu score of 0.19. Why is this fall?
I wish to compare my bleu score of my translation vs google translation api . Google translation API returns plain text, so I also made my output into plain text using above post processing methodology. But then my score is falling and I am not able to understand it. Can anyone guide?

guillaumekln · May 21, 2019, 7:29am

When computing BLEU on detokenized texts, you should use a tool that includes a standard tokenization such as sacreBLEU or multi-bleu-detok.perl.

ajitesh3 · May 22, 2019, 5:03am

hi guillaumekln, I am using the multi-bleu detok-perl script to calculate my bleu score on detokenized data. Is it the decoding using BPE which is affecting my Bleu score?

guillaumekln · May 22, 2019, 7:19am

You should simply not compare tokenized and detokenized BLEU scores. They are not comparable.

ajitesh3 · May 22, 2019, 7:27am

okay,thanks guillaumekln,
Now I am trying to experiment with sentencepiece. Can you tell me for training the sentencepiece model, do i need to generate two models one for source(my case hindi) and one for target(english) and then tokenize respective data. Or I can just combine my sorce and target into one text file separated by tab and generate one single model which I will use to tokenize all corpora. In both cases what should be optimum vocab size? I am having corpora of arounf 1m parallel sentences

valentinmace · May 22, 2019, 8:36am

Hi,

Both approaches are viable

You can generate two models and apply them respectively to source and target data, or generate one model and apply it to both. However for the latter I would not use tab to separate my source and target file, just concatenate them

Also I would recommend to use SentencePiece on already tokenized data.

I use Moses tokenizer, then SP (BPE algorithm) with the first approach (one model per language) and it works very well

ajitesh3 · May 22, 2019, 8:43am

Thanks valentin for your response.
so basically my flow is going to be like this-
I tokenize source(hindi) using indic nlp, target(english) using moses. Then I generate two individual sentencepiece model and apply it to my src, target respectively. I feed those to my model for training.
The i translate my test sentences(hindi) which are tokenized and encoded using sentencepiece english model. Then I decode my final output ? and do I need to truecase my english sentences before encoding using sentencepiece ?? PS: I am keeping my vocab size as 10k as it gave me good result with subword nmt

valentinmace · May 22, 2019, 9:25am

Then i translate my test sentences(hindi) which are tokenized and encoded using sentencepiece english model

Aren’t they encoded using SP hindi model instead of english ?

Then I decode my final output ?

Yes, if spm decode doesn’t works well you can replace _ by a whitespace and remove whitespaces

I personally do not alter the case at any time

ajitesh3 · May 22, 2019, 9:44am

pardon, its hindi model only.
Thanks valentin , I am putting my model for training. Will share my results once done

ajitesh3 · May 23, 2019, 7:37am

Hi valentin,
This is the kind of output I am getting after using sentencepiece and the translating.
[‘▁in’, ‘▁Canada’, ‘▁&’, ‘apos’, ‘;’, ‘▁s’, ‘▁aer’, ‘opl’, ‘ane’, ‘▁and’, ‘▁Tra’, ‘ine’, ‘e’, ‘▁,’, ‘▁B’, ‘om’, ‘ard’, ‘ian’, ‘▁Inc’, ‘▁reported’, ‘▁a’, ‘▁decline’, ‘▁of’, ‘▁15’, ‘▁per’, ‘▁cent’, ‘▁in’, ‘▁the’, ‘▁net’, ‘▁profit’, ‘▁to’, ‘▁Gur’, ‘day’, ‘▁,’, ‘▁in’, ‘▁its’, ‘▁railway’, ‘▁unit’, ‘▁,’, ‘▁in’, ‘▁the’, ‘▁third’, ‘▁quarter’, ‘▁and’, ‘▁lower’, ‘▁air’, ‘▁orders’, ‘▁on’, ‘▁contract’, ‘▁issues’, ‘▁.’]

Problem is, in some out put I am not having [ ] square brackets, don’t know why.And because of this I am not able to decode, its giving error.
‘▁decline’, ‘▁in’, ‘▁the’, ‘▁profits’, ‘▁of’, ‘▁the’, ‘▁B’, ‘om’, ‘inder’, ‘▁as’, ‘▁the’, ‘▁delivery’, ‘▁of’, ‘▁the’, ‘▁air’, ‘port’, ‘▁,’, ‘▁loss’, ‘▁of’, ‘▁orders’, ‘▁.’]

And one more problem is I am getting [ bracket at random place in the out put, which stops my sp.DecodePieces method

guillaumekln · May 23, 2019, 7:40am

Is this coming from OpenNMT-py? Look for a pred.txt file.

ajitesh3 · May 23, 2019, 8:51am

yes, I am using sentencepiece for encoding and decoding

ajitesh3 · May 27, 2019, 6:49am

Hi Valentin,
In my case Subword NMT and Sentencepiece are giving approx same Bleu scores. And by visual inspection also they are more or less similar. Some output are better for subword nmt, some are better for Sentencepiece. Quite difficult to choose one

valentinmace · May 27, 2019, 9:32am

Hi,

Thanks for sharing, could you tell us what parameters you’ve used with Subword NMT and SentencePiece ?

ajitesh3 · May 28, 2019, 5:20am

For Subword I used 10k merge operation and for Sentencepiece I used 10k vocab size which gave me around 32k vocab size for my nmt model. I have around 1M parallel corpus is this the optimum vocab size? Also, shouldn’t sentencepiece outperform subword ?

tel34 · December 11, 2019, 1:14pm

Glad I found this discussion. With my Turkish-English Transformer model I just got a BLEU of 65.96 on my SentencePiece encoded BPE predictions (10000 sentences) and a drop to 59.11 after decoding the text and reference.

vsc100 · March 23, 2023, 10:51am

Can anyone help me with decoding the ouput which is encoded using BPE. I can’t find method to decode the output text in order to find the Bleu score on decoded text.