Score_batch vs score with translate_batch normalized


I would like to be able to render the same score as I would get with “translate_batch” with the option to normalized, but doing so with “score_batch”. I’m not sure where or how it’s being computed. “score_batch” seem to return only the score of each individual token.

I’m combining with a splitter few sentences to improve the translation context, but having a splitter mean that the predict score returned represent all the sentences bulked together. I want to use “score_batch” to get the predict score of each individual sentence.

Best regards,


You just need to average the token level scores. For example:

>>> import ctranslate2
>>> translator = ctranslate2.Translator("ende_transformer")
>>> source_tokens = ["▁H", "ello", "▁world", "!"]
>>> results = translator.translate_batch([source_tokens], normalize_scores=True, return_scores=True)
>>> results[0].scores[0]
>>> target_tokens = results[0].hypotheses[0]
>>> scores = translator.score_batch([source_tokens], [target_tokens])
>>> sum(scores[0]) / len(scores[0])

The score is not exactly equal but should be close enough. Make sure to use the latest version of CTranslate2 as there were some fixes related to the score correctness in some recent versions.


Hello Guillaume,

I did a test, and it seems to be working fine. I do sometimes see a bigger gap in the scoring. I made a little table out of my sample, which I first call the translation function and then the scoring function. The last column is the absolute different. With a quick glance, do these difference seems as expected from your experience? (especially the 0.1 diff)

translate score abs diff
-0,318 -0,248 0,07
-0,388 -0,313 0,075
-0,296 -0,199 0,097
-0,089 -0,089 0
-0,249 -0,161 0,088
-0,403 -0,392 0,011
-0,393 -0,39 0,003
-0,253 -0,255 0,002
-0,272 -0,19 0,082
-0,282 -0,179 0,103
-0,398 -0,41 0,012
-0,223 -0,195 0,028
-0,393 -0,39 0,003
-0,645 -0,606 0,039
-0,398 -0,41 0,012
-0,36 -0,376 0,016
-0,393 -0,39 0,003

The results are fair for my needs, but I want to ensure I haven’t a small glitch somewhere…

Best regards,

Hello Guillaume,

I have one more question. I noticed that I always have +1 predictscore compared to my number of Target Token. So if my Target string had 20 tokens the scoring function will output 21 predict scores.

I thought it was supposed to be 1 for 1?

Could it be related to the SOS/EOS?

Best regards,

The returned value of score_batch is described in the documentation: Translator — CTranslate2 2.18.0 documentation. The extra token corresponds to the end of sentence token </s>.

Note that the score returned by translate_batch also includes the score of this end token.

We will check that. Are you translating/scoring a single example at a time or a batch of examples?

Thank you!

I passed a batch.

I used this small code to check the score difference, and it reports an average difference of 1.5e-7 which is quite small:

import ctranslate2

def batch_iter(path, batch_size=32):
    with open(path) as f:
        batch = []
        for line in f:
            if len(batch) == 32:
                yield batch
                batch = []

        if batch:
            yield batch

translator = ctranslate2.Translator("ende_transformer")
score_batch_scores = []
trans_batch_scores = []

for source in batch_iter("input.txt"):
    results = translator.translate_batch(source, return_scores=True, normalize_scores=True)
    target = []
    for result in results:

    scores = translator.score_batch(source, target)
    for tokens_score in scores:
        score_batch_scores.append(sum(tokens_score) / len(tokens_score))

abs_diff = [abs(a - b) for a, b in zip(trans_batch_scores, score_batch_scores)]
print(sum(abs_diff) / len(abs_diff))

# output: 1.5065470300664154e-07

I’m using dataframes

so my code looks like something like this:

df.loc[df.index, ['predictScore']] = evaluationDF['PredictScoreList'].apply(lambda x: sum(x) / len(x)).to_numpy()

meaby it’s the “/” that has a different rounding since it’s in a lambda function?

But I agree with you that the issue seem to be on my side…!

Thank you Guillaume,

I used your function and adapted it to use it with my dataframe instead of batch_iter.

I’m getting really good results 1e-4 (not as good as you, but still good).

After some testing, I figured out that it’s the parameters I used with translate_batch that cause the difference:
normalize_scores=True, num_hypotheses=1, beam_size=25, repetition_penalty=1.2

These have a significant impact on the score from translate_batch.

Best regards,

repetition_penalty (and other penalties) will change the translation score. This is the expected behavior.

1 Like