Is it possible to do an on the fly preprocessing to avoid generating huge files when the dataset contains millions of sentences (>50M)?
A PR with such feature has been open for a while: https://github.com/OpenNMT/OpenNMT-py/pull/1779
It’s quite too big a change of the codebase to merge it as is though, so @Zenglinxiao is working on a “rationalized” implementation that should be ready soon.
Any guess when will it be available? 50M sentences takes so much disk space after preprocessing.