OpenNMT Forum

Preprocess on the fly with big datasets

Hi,
Is it possible to do an on the fly preprocessing to avoid generating huge files when the dataset contains millions of sentences (>50M)?
Thanks

Hey there,
A PR with such feature has been open for a while: https://github.com/OpenNMT/OpenNMT-py/pull/1779
It’s quite too big a change of the codebase to merge it as is though, so @Zenglinxiao is working on a “rationalized” implementation that should be ready soon.

Any guess when will it be available? 50M sentences takes so much disk space after preprocessing.