Can we export each token score instead of cum log probability score from CTranslate2? Based on some experiment in other transformers models, a higher precision could be achieved by mean log probability score by tuning the threshold.
I’m not sure to understand. If you need the mean log probability you could just divide the cumulative log probability by the number of tokens?
Ah. That’s right. Another question is whether we have included the eos token probability score in the final cumulative probability score.
Good question. I found that we are a bit inconsistent here because the EOS score is included when running beam search (beam_size
> 1), but not when running greedy search (beam_size
= 1).
Probably we should always include the EOS score since it is actually part of the generated sequence. Any thoughts?
The different text generation application might have different performance on it. Would it be easier for users with an optional variable to return all token scores and let them determine the final score?
We can possibly return the token scores, but I’m not sure they are frequently used. In the meantime, I made sure that the returned score is consistent between the beam and greedy decodings:
I think token scores would be useful to squeeze out the best performance when working with score threshold. It turns out a big difference when we play with mean or cumulative score. Currently both fariseq and huggingface can export each token score and one aggregated score.
What exactly do you mean by score threshold? Can you refer to a paper?
In general, the larger the log probability score is, the higher precision the model can achieve. Just like the precision and recall plot in binary classification.
I am curious about the same stuff. Anyway we could possibly have scores for every token from ctranslate2? That would really help