I train my modified transformer (only on attention layer) model on EN-DE translation task.
During evaluating, I ran "python translate.py -gpu 0 -model mymodel/my_transformer_step_200000.pt -src data/wmt_ende_sp/test.en -tgt data/wmt_ende_sp/test.de -replace_unk -verbose -output test_pred
the produced “PRED AVG SCORE: -1.3670, PRED PPL: 3.9237; GOLD AVG SCORE: -1.7637, GOLD PPL: 5.8337” seemed pretty good.
But when I compute BLEU score, ran “perl tools/multi-bleu.perl data/wmt_ende_sp/test.de < test_pred” with the same architecture.
I got “BLEU = 0.24, 7.5/0.6/0.1/0.0 (BP=1.000, ratio=1.694, hyp_len=141880, ref_len=83752)”. It seemed pretty strange since the evaluating results looks reasonable.
Did anybody evev encounter the same problem?