Text Summarization on Gigaword and ROUGE Scoring

pltrdy · August 30, 2017, 8:46am

Indeed. I just pushed new commits in both pythonrouge and files2rouge.
You must then pull & run setup.py again for both.

My python 2.7 files2rouge now works.

vikash · August 31, 2017, 9:34am

I am using the pre-trained model given by twang. I am facing issues in the implementation of the model.

My Ubuntu 16.04 server doesn’t have a GPU.

Issue:

$ python translate.py -model textsum_acc_51.38_ppl_12.59_e13.pt -src …/sumdata/Giga/input.txt

Traceback (most recent call last):

    ImportError: No module named Dict

I installed a library “dict” (i couldn’t find any library named “Dict”) and couldn’t solve the problem.

python version : 2.7.12

torch versions:
torch==0.2.0.post3
torchtext==0.2.0a0

I am new to python and not able to crack this. Any leads on this?

pltrdy · September 1, 2017, 3:27pm

I must say I never used OpenNMT-py with Python 2.7.

I would first recommend to try using python 3.x.

loretoparisi · December 1, 2017, 8:52am

Since now there is a Tensorflow wrapper: OpenNMT-tf: a new alternative
It would be interesting to provide a tutorial using this alternative version.

SinaMohseni · December 19, 2017, 8:33pm

Hi all,

Thanks for helpful instructions. I trained he model using a gpu, it works fine on sample data in Giga folder, although for any other articles that are try it doesn’t generate any output or just a word or two. Any suggestion?

I’m trying these two short articles:

british intelligence sources report that the group of approximately five somali pirates who have captured the mv Tanya off the somalian coast call themselves the waterways protection regional guard
sources confirmed that diamonds were shipped from yemen to moscow by georgiy giunter on december.

Output: protection is regional guard protection regional guard

giunter is a dealer in jewelry and precious stones who does business in the middle east and russia. giunter is a money launderer in addition to his legitimate gemstone work.

Output: sent from yemen to moscow

Ifad · December 31, 2017, 10:38am

i tried to generate translation with OpenNMT-py using @twang pre-trained model:

python3 translate.py -model textsum_acc_51.38_ppl_12.59_e13.pt -src ../data/bitcoin-tosum.txt

However i got this result:

um_acc_51.38_ppl_12.59_e13.pt -src ../data/bitcoin-tosum.txt
Traceback (most recent call last):
File “translate.py”, line 116, in
main()
File “translate.py”, line 39, in main
onmt.ModelConstructor.load_test_model(opt, dummy_opt.dict)
File “/Users/ifadardin/Documents/Python/OpenNMT-py-master/onmt/ModelConstructor.py”, line 114, in load_test_model
map_location=lambda storage, loc: storage)
File “/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/torch/serialization.py”, line 261, in load
return _load(f, map_location, pickle_module)
File “/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/torch/serialization.py”, line 409, in _load
result = unpickler.load()
ImportError: No module named ‘onmt.Dict’

I checked in GitHub it might be because onmt.Dict is eliminated on last summer update. Is there any work around here?

pltrdy · January 2, 2018, 9:09am

@SinaMohseni It’s not always easy to debug model’s behavior. Your case may be related to https://github.com/OpenNMT/OpenNMT-py/issues/457 i.e. we sometime need to force a minimum output size otherwise it stops too early. If it does not help I would suggest you to open an issue.

@Ifad As you said, it is occuring because the model has been trained with another OpenNMT-py version. It make sense to open an issue for this.

SinaMohseni · January 7, 2018, 7:27pm

I had the same problem in OpenNMT-py, If you have a gpu you could try my trained model. it’s here.

I tried to release the model to be used for both GPU and CPU systems, but couldn’t find the release_model codes.

pltrdy · January 8, 2018, 9:54am

I downloaded your model and copy-pasted the sentences you wrote. I’m not sure if you are actually running the translation from those very sentences but you shouldn’t. It must be tokenized before e.g. text in lower case (see “Tanya” in first sentence); separated punctuation (see “russia. giunter” instead of “russia . giunter”).

You may think that it’s just detail, but in fact, just by editing the input I got:

SENT 1: ('british', 'intelligence', 'sources', 'report', 'that', 'the', 'group', 'of', 'approximately', 'five', 'somali', 'pirates', 'who', 'have', 'captured', 'the', 'mv', 'tanya', 'off', 'the', 'somalian', 'coast', 'call', 'themselves', 'the', 'waterways', 'protection', 'regional', 'guard', 'sources', 'confirmed', 'that', 'diamonds', 'were', 'shipped', 'from', 'yemen', 'to', 'moscow', 'by', 'georgiy', 'giunter', 'on', 'december')
PRED 1: somali pirates send diamonds to moscow
PRED SCORE: -6.8853
SENT 2: ('giunter', 'is', 'a', 'dealer', 'in', 'jewelry', 'and', 'precious', 'stones', 'who', 'does', 'business', 'in', 'the', 'middle', 'east', 'and', 'russia', '.', 'giunter', 'is', 'a', 'money', 'launderer', 'in', 'addition', 'to', 'his', 'legitimate', 'gemstone', 'work', '.')
PRED 2: the <unk> of <unk>

We can see that the second prediction is pretty bad. My guess is that your input here is two sentences. This case does not occurs in the Gigaword training inputs, thus, this make sense that the model struggles, doesn’t it?

yuxinz · January 9, 2018, 10:08am

Is there such a option in the Torch version to force a minimum output?
Even in the giga_0_pred.txt provided there are some short or empty sentences. I tested on the dataset for the DUC2004 task1 and the results are also bad. I also found that the larger the beam size, the shorter the average length of the output, which limits the ROUGE score(the recall metric).

pltrdy · January 9, 2018, 10:26am

It is now implemented in OpenNMT-py since #496.

I’m not sure about the Lua version, I haven’t found thing like this in the translate.lua options so probably not.

yuxinz · January 9, 2018, 10:29am

Yeah I saw that thread as well…but I guess the model needs to be trained again using the Python script? Or can the translate.py take the current model as argument and make predictions?

pltrdy · January 9, 2018, 10:31am

You don’t have to train again specifically for this feature.

There may be some incompatibility between old models and current state (depending on how old is your model), but not from this change.

yuxinz · January 9, 2018, 10:40am

Actually I was asking whether the translate.py script can take the model produced by train.lua since I haven’t used the Python version.

I just tested and it certainly doesn’t work.

pltrdy · January 9, 2018, 10:49am

Oh, no. Transferring from Lua to Python torch isn’t possible.

Ifad · January 13, 2018, 11:09am

@SinaMohseni thanks man! your model works. is there any way to expand of the result? for example,

input:
the sense of smell , as marcel proust and his madeleine made clear , is intimately tied to feeling and memory , so it is perhaps not surprising that in schizophrenia , an illness that plays havoc with the emotional capacities of those who suffer from it , the sense of smell is impaired .

result:
the sense of smell

i think if i can elongate the result it will be better.

SinaMohseni · January 15, 2018, 12:18am

Thanks for taking a look! you are right, I didn’t tokenize it before.

SinaMohseni · January 15, 2018, 12:27am

I should say you are trying a super complicated and long sentence, maybe start with something shorter?

As @pltrdy replied to me, make sure to tokenize the input. Also, your input should be in a single-sentence format. Take a look at this issue for more information.

vjstha20 · August 12, 2018, 8:14am

I am running Windows 10 and have implemented all the steps but while executing the first command:

python preprocess.py -train_src …/data/train/train.article.txt -train_tgt …/data/train/train.title.txt -valid_src …/data/train/valid.article.filter.txt -valid_tgt …/data/train/valid.title.filter.txt -save_data …/data/train/textsum

I am getting error at init dtype ?? whats wrong?? can anyone suggest a better way

Huijun-Cui · February 7, 2019, 2:51pm

Hi I am new te Open-NMT I use the dataset gigaword which you recommended . When I see the data file input.txt I am confused there are many token , I don`t understand how this happen , It seems to have been preprocessed , if it does , Could you tell me how the tokens generate during data processing?