About summarization ROUGE score

DaisyTung · June 11, 2021, 6:37pm

Hi, I tried to use OpenNMT-py to do summarization task on Chinese dataset and English dataset.

I use rouge( A full Python library for the ROUGE metric) to evaluate my result.
The result on Chinese dataset looks normally, nearly as others’ results.
However, the result on English gigword dataset is poor, much poor than others’ result. (I have tried BiGRU and Transformer, both get poor score, about 30-31 on Rouge-1-F, but others can get about 33-35 on Rouge-1-F)
I also tried to use pyrouge to evaluate, but still get similar poor score.

So I tried to download the model parameter on OpenNMT Docs(Summarization — OpenNMT-py documentation) to verify that if it can produce normal Rouge score. (I use the model parameter of Gigaword)
But I still get poor score, about 32 of Rouge-1-F, but the docs write that this model parameter can get 35.51 on Rouge-1-F.

Can someone tell me why my evaluation score always worse than others? Is there anything need to do before evaluate English summarization but I don’t know(such as stemming?)

I am not good at English, so if the describe is not clear, please that me know.
Thank you for your time.