I would like to be able to render the same score as I would get with “translate_batch” with the option to normalized, but doing so with “score_batch”. I’m not sure where or how it’s being computed. “score_batch” seem to return only the score of each individual token.
I’m combining with a splitter few sentences to improve the translation context, but having a splitter mean that the predict score returned represent all the sentences bulked together. I want to use “score_batch” to get the predict score of each individual sentence.
You just need to average the token level scores. For example:
>>> import ctranslate2
>>> translator = ctranslate2.Translator("ende_transformer")
>>> source_tokens = ["▁H", "ello", "▁world", "!"]
>>> results = translator.translate_batch([source_tokens], normalize_scores=True, return_scores=True)
>>> target_tokens = results.hypotheses
>>> scores = translator.score_batch([source_tokens], [target_tokens])
>>> sum(scores) / len(scores)
The score is not exactly equal but should be close enough. Make sure to use the latest version of CTranslate2 as there were some fixes related to the score correctness in some recent versions.
I did a test, and it seems to be working fine. I do sometimes see a bigger gap in the scoring. I made a little table out of my sample, which I first call the translation function and then the scoring function. The last column is the absolute different. With a quick glance, do these difference seems as expected from your experience? (especially the 0.1 diff)
The results are fair for my needs, but I want to ensure I haven’t a small glitch somewhere…
I have one more question. I noticed that I always have +1 predictscore compared to my number of Target Token. So if my Target string had 20 tokens the scoring function will output 21 predict scores.
I thought it was supposed to be 1 for 1?
Could it be related to the SOS/EOS?
The returned value of
score_batch is described in the documentation: Translator — CTranslate2 2.18.0 documentation. The extra token corresponds to the end of sentence token
Note that the score returned by
translate_batch also includes the score of this end token.
We will check that. Are you translating/scoring a single example at a time or a batch of examples?
I used this small code to check the score difference, and it reports an average difference of 1.5e-7 which is quite small:
def batch_iter(path, batch_size=32):
with open(path) as f:
batch = 
for line in f:
if len(batch) == 32:
batch = 
translator = ctranslate2.Translator("ende_transformer")
score_batch_scores = 
trans_batch_scores = 
for source in batch_iter("input.txt"):
results = translator.translate_batch(source, return_scores=True, normalize_scores=True)
target = 
for result in results:
scores = translator.score_batch(source, target)
for tokens_score in scores:
score_batch_scores.append(sum(tokens_score) / len(tokens_score))
abs_diff = [abs(a - b) for a, b in zip(trans_batch_scores, score_batch_scores)]
print(sum(abs_diff) / len(abs_diff))
# output: 1.5065470300664154e-07
I’m using dataframes
so my code looks like something like this:
df.loc[df.index, ['predictScore']] = evaluationDF['PredictScoreList'].apply(lambda x: sum(x) / len(x)).to_numpy()
meaby it’s the “/” that has a different rounding since it’s in a lambda function?
But I agree with you that the issue seem to be on my side…!
Thank you Guillaume,
I used your function and adapted it to use it with my dataframe instead of batch_iter.
I’m getting really good results 1e-4 (not as good as you, but still good).
After some testing, I figured out that it’s the parameters I used with translate_batch that cause the difference:
normalize_scores=True, num_hypotheses=1, beam_size=25, repetition_penalty=1.2
These have a significant impact on the score from translate_batch.
repetition_penalty (and other penalties) will change the translation score. This is the expected behavior.