Text Summarization on Gigaword and ROUGE Scoring

I had the same problem in OpenNMT-py, If you have a gpu you could try my trained model. it’s here.

I tried to release the model to be used for both GPU and CPU systems, but couldn’t find the release_model codes.

1 Like

I downloaded your model and copy-pasted the sentences you wrote. I’m not sure if you are actually running the translation from those very sentences but you shouldn’t. It must be tokenized before e.g. text in lower case (see “Tanya” in first sentence); separated punctuation (see “russia. giunter” instead of “russia . giunter”).

You may think that it’s just detail, but in fact, just by editing the input I got:

SENT 1: ('british', 'intelligence', 'sources', 'report', 'that', 'the', 'group', 'of', 'approximately', 'five', 'somali', 'pirates', 'who', 'have', 'captured', 'the', 'mv', 'tanya', 'off', 'the', 'somalian', 'coast', 'call', 'themselves', 'the', 'waterways', 'protection', 'regional', 'guard', 'sources', 'confirmed', 'that', 'diamonds', 'were', 'shipped', 'from', 'yemen', 'to', 'moscow', 'by', 'georgiy', 'giunter', 'on', 'december')
PRED 1: somali pirates send diamonds to moscow
PRED SCORE: -6.8853
SENT 2: ('giunter', 'is', 'a', 'dealer', 'in', 'jewelry', 'and', 'precious', 'stones', 'who', 'does', 'business', 'in', 'the', 'middle', 'east', 'and', 'russia', '.', 'giunter', 'is', 'a', 'money', 'launderer', 'in', 'addition', 'to', 'his', 'legitimate', 'gemstone', 'work', '.')
PRED 2: the <unk> of <unk>

We can see that the second prediction is pretty bad. My guess is that your input here is two sentences. This case does not occurs in the Gigaword training inputs, thus, this make sense that the model struggles, doesn’t it?

1 Like

Is there such a option in the Torch version to force a minimum output?
Even in the giga_0_pred.txt provided there are some short or empty sentences. I tested on the dataset for the DUC2004 task1 and the results are also bad. I also found that the larger the beam size, the shorter the average length of the output, which limits the ROUGE score(the recall metric).

It is now implemented in OpenNMT-py since #496.

I’m not sure about the Lua version, I haven’t found thing like this in the translate.lua options so probably not.

Yeah I saw that thread as well…but I guess the model needs to be trained again using the Python script? Or can the translate.py take the current model as argument and make predictions?

You don’t have to train again specifically for this feature.


There may be some incompatibility between old models and current state (depending on how old is your model), but not from this change.

Actually I was asking whether the translate.py script can take the model produced by train.lua since I haven’t used the Python version.

I just tested and it certainly doesn’t work.

Oh, no. Transferring from Lua to Python torch isn’t possible.

@SinaMohseni thanks man! your model works. is there any way to expand of the result? for example,

input:
the sense of smell , as marcel proust and his madeleine made clear , is intimately tied to feeling and memory , so it is perhaps not surprising that in schizophrenia , an illness that plays havoc with the emotional capacities of those who suffer from it , the sense of smell is impaired .

result:
the sense of smell

i think if i can elongate the result it will be better.

Thanks for taking a look! you are right, I didn’t tokenize it before.

I should say you are trying a super complicated and long sentence, maybe start with something shorter?

As @pltrdy replied to me, make sure to tokenize the input. Also, your input should be in a single-sentence format. Take a look at this issue for more information.

1 Like

I am running Windows 10 and have implemented all the steps but while executing the first command:

python preprocess.py -train_src …/data/train/train.article.txt -train_tgt …/data/train/train.title.txt -valid_src …/data/train/valid.article.filter.txt -valid_tgt …/data/train/valid.title.filter.txt -save_data …/data/train/textsum

I am getting error at init dtype ?? whats wrong?? can anyone suggest a better way

Hi I am new te Open-NMT I use the dataset gigaword which you recommended . When I see the data file input.txt I am confused there are many token , I don`t understand how this happen , It seems to have been preprocessed , if it does , Could you tell me how the tokens generate during data processing?