In Opennmt 2.0.0, I used to use function preprocess.py to preprocess data.
In Opennmt 2.0.1, do we have a similar function? How can we preprocess data with Opennmt 2.0.1?
Thank you
Hello @tuankstn,
Before OpenNMT-py 2.0 release, we use preprocess.py
to iterate over the data for:
- Convert raw data into torchtext examples and store them as torchtext datasets: this will be saved as
*.train_{shard_id}.pt
- Find vocabularies by counting tokens in the data in order to build torchtext fields: this will be saved as
*.vocab.pt
After OpenNMT-py 2.0, we do not need to preprocess the data into the torchtext binary format in advance, this is done during training, aka “on-the-fly”:
- Each line pair from the parallel corpus will be handled exactly the same as before, dataset is created on-the-fly
- Vocabularies required by the model should be provided: you can use
build_vocab.py
to retrieve the vocabulary which works similar to the previouspreprocess.py
2 Likes