OpenNMT Forum

File size increased 2.5 times after pre processing speech to text

I have 41 GB data for preprocessing. After preprocessing it converted into approx. 100 GB size.
What should I do to reduce the preprocessed file size?

The order of magnitudes seems on par with the demo dataset. It is surely because preprocessing dumps additional computed features to the shards. Not sure we can do anything apart from changing parameters like frame duration/stride or feature size, but this may hinder your performance.
You could also try to compute features on the fly if you feel comfortable diving in the code.

only way is to shard.

Shard is not helpful, I tried.

not helpful in what sense?
it makes much smaller files that can be handled easily, no ?

about handling part you are right . but preprocessing file size did not reduce.