Preprocess on the fly with big datasets

anderleich · July 7, 2020, 11:07am

Hi,
Is it possible to do an on the fly preprocessing to avoid generating huge files when the dataset contains millions of sentences (>50M)?
Thanks

francoishernandez · July 7, 2020, 12:35pm

Hey there,
A PR with such feature has been open for a while: https://github.com/OpenNMT/OpenNMT-py/pull/1779
It’s quite too big a change of the codebase to merge it as is though, so @Zenglinxiao is working on a “rationalized” implementation that should be ready soon.

anderleich · July 9, 2020, 2:51pm

Any guess when will it be available? 50M sentences takes so much disk space after preprocessing.

anderleich · October 1, 2020, 2:30pm

This problem is solved with OpenNMT-py v2.0: OpenNMT-py 2.0 release