What does preprocess.py do with our dataset

kyquang97 · March 9, 2019, 8:24am

hi everyone, I’m kinda new with OpenNMT-py toolkit. I’m following quickstart steps on OpenNMT-py documentation pages. When i used the preprocess.py, it’s generate 3 files which are *.vocab.pt, *.train.0.pt and *.valid.o.pt. I still don’t know what happened to my dataset when i run the preprocess.py. Are there any explanations about it ? Please share with me. Thanks for your support

guillaumekln · March 11, 2019, 9:09am

Hi,

The preprocessing does not do much. It computes the vocabularies given the most frequent tokens, filters too long sentences, and assigns an index to each token.

Look at the code for more details.