Low bleu score with Sentencepiece comparing to othoner tokenizers

I used opennmt-tf with sentencepiece to train a English to Chinese transfomrer. Also, I used tensor2tensor tokenizer to tokenize the same data and opennmt to train the same model.
It’s wierd that when opennmt-tf auto evaluated bleu scores on the eval dataset, model with sentencepiece tokenizer got bleu score 13, but model with t2t tokenizer got bleu score 20. The two scores are so differenet.
So I exported two models, and I evaluated them by myself, model with sentencepiece tokenizer got bleu score 24, but model with t2t tokenizer got bleu score 26. These socres are normal.
This is my data.yml when using sentencepiece. I don’t know what’s wrong making the scores of sentencepiece so low than it’s real scores.

  model_dir: ./check_point/

  train_features_file: ./bpe_train.en
  train_labels_file: ./bpe_train.zh
  eval_features_file: ./bpe_test.en
  eval_labels_file: ./bpe_test.zh
  source_vocabulary: ./spm_en.vocab
  target_vocabulary: ./spm_zh.vocab
  export_vocabulary_assets: True
train:
  save_checkpoints_steps: 2000
  max_step: 2000000
  train_steps: 2000000
  batch_size: 8192
eval:
  eval_delay: 3600  # Every 1 hour
  external_evaluators: BLEU
  export_on_best: bleu
  export_format: ctranslate2
  max_exports_to_keep: 5
infer:
  batch_size: 32

Correct me if I’m wrong,

But im understanding that you are using 2 different tokenizer to generate 2 different model. Based on my understanding, you’r bound to have compete different results from the bleu score calculated in opennmt-tf as the bleu score is calculated based on the token themself. You could see drastic differences between the 2 results. (Just like you are right now) because the tokens are completely different. You just can’t compare those results… its like comparing apples and bananas or 2 differents kind of appels would be more exact.

The only way to compare both tokenizer performance with the bleu score metric is:
1- train a model for each tokenizer
2 - translate your test set with both model and use the appropriate tokenizer to untokenize the results.
3- calculate the bleu score from both results.

Doing this your comparing sentences structures exactly the same way and not with a different form of tokenization.

If it’s still not clear here is a quick example:

The segment: Hello

Tokenizer 1: H e l l o (each letter is considered 1 token)
Tokenizer 2: Hello (the full word is one token)

If we calculate the bleu score
With the tokenizer 1 if the model generate (H e l a) you will have a pretty decent bleu score.

With tokenizer 2 : if the model put the wrong word you get 0, but if it’s the correct one you get 100

Hope I understood right your question,
Samuel

1 Like

Thank you so much for your reply!
I calculate bleu score after untokenizing two models. The bleu score of model with spm tokenizer is similiar to the model with t2t tokenizer.
Then I count the number of subtokens for same sentences, one using spm tokenizer and the other using t2t tokenizer. The result is exactly same as your answer. For target language Chinese, t2t tokenizer split sentences in more little pieces than spm tokenzer. I guess this is the reason why model with t2t tokenizer has much higher bleu scores than model with spm tokenizer, especially when opennmt auto calculate bleu scores by subtokens.