OpenNMT Forum

Randomly shuffling examples (OpenNMT-py v2.0)

Hi,

I’m carrying out some experiments using the brand new Pytorch v2 release. I was wondering if some shuffling is done prior training or do we still need to randomly shuffle data before building the vocabulary as with the previous version before the preprocessing step?? If shuffling is performed, do we need to set some flag?

Thanks

Shuffling is not performed when reading the data.
The only thing approaching this is the ‘pooling’ mechanism (pool_factor opt) which allows to load the equivalent of pool_factor batches of examples, sort these examples by length (for optimization purposes), and shuffle these batches between themselves before yielding them.
Performing a corpus-wide shuffling on the data on the fly is not trivial, at least not without loading the whole data in memory (which we often can’t / don’t want to do).
Also, it’s not that much of a burden to shuffle datasets prior to using OpenNMT.

Thanks @francoishernandez!
Is pool_factor applied by default?

Yes, with value 8192: https://github.com/OpenNMT/OpenNMT-py/blob/8b073fb2a047509ff590839b1194a155ec1a50bf/onmt/opts.py#L469

You probably want to increase the bucket_sizeas well: https://github.com/OpenNMT/OpenNMT-py/blob/8b073fb2a047509ff590839b1194a155ec1a50bf/onmt/opts.py#L594-L595