OpenNMT Forum

Preprocess on the fly with big datasets

Is it possible to do an on the fly preprocessing to avoid generating huge files when the dataset contains millions of sentences (>50M)?

Hey there,
A PR with such feature has been open for a while:
It’s quite too big a change of the codebase to merge it as is though, so @Zenglinxiao is working on a “rationalized” implementation that should be ready soon.

Any guess when will it be available? 50M sentences takes so much disk space after preprocessing.

This problem is solved with OpenNMT-py v2.0: OpenNMT-py 2.0 release