Different Bleu Score Results

my model was trained for 95000 steps
I have evaluated it by my test data the measure Bleu score through this command:
perl tools/multi-bleu.perl data/test.tgt.txt < pred.txt
I get this results : BLEU = 12.31, 40.2/21.6/16.1/14.1 (BP=0.584, ratio=0.650, hyp_len=206799, ref_len=318165)
i.e. Bleu4 score = 14.1

but I calculate Bleu4 score through nltk module through this code :

from nltk.translate.bleu_score import SmoothingFunction
smoothie = SmoothingFunction().method4
from nltk.translate import bleu
f1 = open(‘pred.txt’)
lines = f1.readlines()
f2 = open(‘data/test.tgt.txt’)
lines2 = f2.readlines()
s = 0
for i in range(len(lines)):
s+= bleu([lines[i]], lines2[i], smoothing_function=smoothie)

I get Bleu4 score = 0.32130239102279534 !!
what is the reason about this difference and which score is correct ?

BLEU is a corpus level metric. You can’t average it by the number of sentences.

If you want to compute BLEU score with Python, I suggest using sacreBLEU on detokenized outputs: