Good PRED AVG SOCRE and PPL, but poor BLEU score

I train my modified transformer (only on attention layer) model on EN-DE translation task.
During evaluating, I ran "python -gpu 0 -model mymodel/ -src data/wmt_ende_sp/test.en -tgt data/wmt_ende_sp/ -replace_unk -verbose -output test_pred
the produced “PRED AVG SCORE: -1.3670, PRED PPL: 3.9237; GOLD AVG SCORE: -1.7637, GOLD PPL: 5.8337” seemed pretty good.

But when I compute BLEU score, ran “perl tools/multi-bleu.perl data/wmt_ende_sp/ < test_pred” with the same architecture.
I got “BLEU = 0.24, 7.5/0.6/0.1/0.0 (BP=1.000, ratio=1.694, hyp_len=141880, ref_len=83752)”. It seemed pretty strange since the evaluating results looks reasonable.
Did anybody evev encounter the same problem?

Maybe you made a mistake when preparing the training data? Hard to tell without more information.

Thanks… Trained another model and it works fine… wired.

mean of PRED AVG SOCRE ?