How to preprocess data in Opennmt 2.0.1

In Opennmt 2.0.0, I used to use function preprocess.py to preprocess data.
In Opennmt 2.0.1, do we have a similar function? How can we preprocess data with Opennmt 2.0.1?
Thank you

Hello @tuankstn,
Before OpenNMT-py 2.0 release, we use preprocess.py to iterate over the data for:

  • Convert raw data into torchtext examples and store them as torchtext datasets: this will be saved as *.train_{shard_id}.pt
  • Find vocabularies by counting tokens in the data in order to build torchtext fields: this will be saved as *.vocab.pt

After OpenNMT-py 2.0, we do not need to preprocess the data into the torchtext binary format in advance, this is done during training, aka “on-the-fly”:

  • Each line pair from the parallel corpus will be handled exactly the same as before, dataset is created on-the-fly
  • Vocabularies required by the model should be provided: you can use build_vocab.py to retrieve the vocabulary which works similar to the previous preprocess.py
2 Likes

@Zenglinxiao Thank you so much.