Text Summarization on Gigaword and ROUGE Scoring

SinaMohseni · January 7, 2018, 7:27pm

I had the same problem in OpenNMT-py, If you have a gpu you could try my trained model. it’s here.

I tried to release the model to be used for both GPU and CPU systems, but couldn’t find the release_model codes.

pltrdy · January 8, 2018, 9:54am

I downloaded your model and copy-pasted the sentences you wrote. I’m not sure if you are actually running the translation from those very sentences but you shouldn’t. It must be tokenized before e.g. text in lower case (see “Tanya” in first sentence); separated punctuation (see “russia. giunter” instead of “russia . giunter”).

You may think that it’s just detail, but in fact, just by editing the input I got:

SENT 1: ('british', 'intelligence', 'sources', 'report', 'that', 'the', 'group', 'of', 'approximately', 'five', 'somali', 'pirates', 'who', 'have', 'captured', 'the', 'mv', 'tanya', 'off', 'the', 'somalian', 'coast', 'call', 'themselves', 'the', 'waterways', 'protection', 'regional', 'guard', 'sources', 'confirmed', 'that', 'diamonds', 'were', 'shipped', 'from', 'yemen', 'to', 'moscow', 'by', 'georgiy', 'giunter', 'on', 'december')
PRED 1: somali pirates send diamonds to moscow
PRED SCORE: -6.8853
SENT 2: ('giunter', 'is', 'a', 'dealer', 'in', 'jewelry', 'and', 'precious', 'stones', 'who', 'does', 'business', 'in', 'the', 'middle', 'east', 'and', 'russia', '.', 'giunter', 'is', 'a', 'money', 'launderer', 'in', 'addition', 'to', 'his', 'legitimate', 'gemstone', 'work', '.')
PRED 2: the <unk> of <unk>

We can see that the second prediction is pretty bad. My guess is that your input here is two sentences. This case does not occurs in the Gigaword training inputs, thus, this make sense that the model struggles, doesn’t it?

yuxinz · January 9, 2018, 10:08am

Is there such a option in the Torch version to force a minimum output?
Even in the giga_0_pred.txt provided there are some short or empty sentences. I tested on the dataset for the DUC2004 task1 and the results are also bad. I also found that the larger the beam size, the shorter the average length of the output, which limits the ROUGE score(the recall metric).

pltrdy · January 9, 2018, 10:26am

It is now implemented in OpenNMT-py since #496.

I’m not sure about the Lua version, I haven’t found thing like this in the translate.lua options so probably not.

yuxinz · January 9, 2018, 10:29am

Yeah I saw that thread as well…but I guess the model needs to be trained again using the Python script? Or can the translate.py take the current model as argument and make predictions?

pltrdy · January 9, 2018, 10:31am

You don’t have to train again specifically for this feature.

There may be some incompatibility between old models and current state (depending on how old is your model), but not from this change.

yuxinz · January 9, 2018, 10:40am

Actually I was asking whether the translate.py script can take the model produced by train.lua since I haven’t used the Python version.

I just tested and it certainly doesn’t work.

pltrdy · January 9, 2018, 10:49am

Oh, no. Transferring from Lua to Python torch isn’t possible.

Ifad · January 13, 2018, 11:09am

@SinaMohseni thanks man! your model works. is there any way to expand of the result? for example,

input:
the sense of smell , as marcel proust and his madeleine made clear , is intimately tied to feeling and memory , so it is perhaps not surprising that in schizophrenia , an illness that plays havoc with the emotional capacities of those who suffer from it , the sense of smell is impaired .

result:
the sense of smell

i think if i can elongate the result it will be better.

SinaMohseni · January 15, 2018, 12:18am

Thanks for taking a look! you are right, I didn’t tokenize it before.

SinaMohseni · January 15, 2018, 12:27am

I should say you are trying a super complicated and long sentence, maybe start with something shorter?

As @pltrdy replied to me, make sure to tokenize the input. Also, your input should be in a single-sentence format. Take a look at this issue for more information.

vjstha20 · August 12, 2018, 8:14am

I am running Windows 10 and have implemented all the steps but while executing the first command:

python preprocess.py -train_src …/data/train/train.article.txt -train_tgt …/data/train/train.title.txt -valid_src …/data/train/valid.article.filter.txt -valid_tgt …/data/train/valid.title.filter.txt -save_data …/data/train/textsum

I am getting error at init dtype ?? whats wrong?? can anyone suggest a better way

Huijun-Cui · February 7, 2019, 2:51pm

Hi I am new te Open-NMT I use the dataset gigaword which you recommended . When I see the data file input.txt I am confused there are many token , I don`t understand how this happen , It seems to have been preprocessed , if it does , Could you tell me how the tokens generate during data processing?