What does preprocess.py do with our dataset

hi everyone, I’m kinda new with OpenNMT-py toolkit. I’m following quickstart steps on OpenNMT-py documentation pages. When i used the preprocess.py, it’s generate 3 files which are *.vocab.pt, *.train.0.pt and *.valid.o.pt. I still don’t know what happened to my dataset when i run the preprocess.py. Are there any explanations about it ? Please share with me. Thanks for your support :slight_smile:

1 Like

Hi,

The preprocessing does not do much. It computes the vocabularies given the most frequent tokens, filters too long sentences, and assigns an index to each token.

Look at the code for more details.

1 Like