How to preprocess data in Opennmt 2.0.1

tuankstn · May 15, 2021, 8:11am

In Opennmt 2.0.0, I used to use function preprocess.py to preprocess data.
In Opennmt 2.0.1, do we have a similar function? How can we preprocess data with Opennmt 2.0.1?
Thank you

Zenglinxiao · May 18, 2021, 2:01pm

Hello @tuankstn,
Before OpenNMT-py 2.0 release, we use preprocess.py to iterate over the data for:

Convert raw data into torchtext examples and store them as torchtext datasets: this will be saved as *.train_{shard_id}.pt
Find vocabularies by counting tokens in the data in order to build torchtext fields: this will be saved as *.vocab.pt

After OpenNMT-py 2.0, we do not need to preprocess the data into the torchtext binary format in advance, this is done during training, aka “on-the-fly”:

Each line pair from the parallel corpus will be handled exactly the same as before, dataset is created on-the-fly
Vocabularies required by the model should be provided: you can use build_vocab.py to retrieve the vocabulary which works similar to the previous preprocess.py

tuankstn · December 25, 2021, 3:40pm

@Zenglinxiao Thank you so much.