Need a proper understanding of the preprocess step

JafferWilson · June 10, 2017, 2:43am

I am trying to have my own dataset for the training purpose. Especially for the summarization. As I know there are three steps of OpenNMT: Preprocess, train and translate.
I have question regarding the preProcess step.
I have gathered article and there summaries.
I am trying to run the preprocess step as:
python preprocess.py -train_src article.txt -train_tgt summary -valid_src ../data/train/valid.article.filter.txt -valid_tgt ../data/train/valid.title.filter.txt -save_data ../data/train/textsum
But I do not understand what I need to give in the field of: -valid_tgt and -valid_src. Do I need to change that already used sources? If not then what could be the replacements. I have used an article from the NYT for testing purpose.
Kindly help me.

jean.senellart · June 10, 2017, 6:47am

Hi @JafferWilson, see:

your validation data is a smaller set contain source/target sentences ideally not part of the training data so that you can monitor progress.

JafferWilson · June 10, 2017, 7:12am

My apologize… but I could not understand what you mean. As mentioned in your reply, d you want to say that the validation src and tgt, I can put there my source data and TGT? Please guide me.

jean.senellart · June 10, 2017, 12:58pm

if I cannot understand you want to use a single article for the training? it won’t work - your training corpus should contains a big number of articles/summary (as in this corpus: https://github.com/harvardnlp/sent-summary) and out of this set, you keep a small subset used for the validation.

JafferWilson · June 12, 2017, 2:00am

Very true. I have gone through the train set what your reference link has mentioned. But I am trying to replace the train article and train summary with my article summary.
What I have done is: copied the complete article from the website and have summarized it.Then created 2 files: one containing the complete article and second containing the summary.
I tried to use my article as source and summary as Train_TGT. But I have kept the valid_src and valid_tgt as the same one used the example. And that is the problem. I am getting errors. See the topic I have mentioned: Having issue while preprocessing the data
Hence, I was trying to know what I need to put in the files as my Valid_src and Valid_tgt.
So, kindly, let me know what I need to be done. Thank you.