Preprocess on the fly with big datasets

Is it possible to do an on the fly preprocessing to avoid generating huge files when the dataset contains millions of sentences (>50M)?

Hey there,
A PR with such feature has been open for a while:
It’s quite too big a change of the codebase to merge it as is though, so @Zenglinxiao is working on a “rationalized” implementation that should be ready soon.

Any guess when will it be available? 50M sentences takes so much disk space after preprocessing.

This problem is solved with OpenNMT-py v2.0: OpenNMT-py 2.0 release