Randomly shuffling examples (OpenNMT-py v2.0)

Hi,

I’m carrying out some experiments using the brand new Pytorch v2 release. I was wondering if some shuffling is done prior training or do we still need to randomly shuffle data before building the vocabulary as with the previous version before the preprocessing step?? If shuffling is performed, do we need to set some flag?

Thanks

Shuffling is not performed when reading the data.
The only thing approaching this is the ‘pooling’ mechanism (pool_factor opt) which allows to load the equivalent of pool_factor batches of examples, sort these examples by length (for optimization purposes), and shuffle these batches between themselves before yielding them.
Performing a corpus-wide shuffling on the data on the fly is not trivial, at least not without loading the whole data in memory (which we often can’t / don’t want to do).
Also, it’s not that much of a burden to shuffle datasets prior to using OpenNMT.

Thanks @francoishernandez!
Is pool_factor applied by default?

Yes, with value 8192: https://github.com/OpenNMT/OpenNMT-py/blob/8b073fb2a047509ff590839b1194a155ec1a50bf/onmt/opts.py#L469

You probably want to increase the bucket_sizeas well: https://github.com/OpenNMT/OpenNMT-py/blob/8b073fb2a047509ff590839b1194a155ec1a50bf/onmt/opts.py#L594-L595

When using different corpora are the examples taken randomly from each set to form a batch? When concating differet corpora into a single file, I’ve observed if no shuffling is performed, that some checkpoints have learned a specifi corpus while other checpoints another corpus. That’s why I ask this question, can we be sure all batches will contain information from all the corpora? This is to avoid cocating all corpora and shuffling it before training.

Thanks

As stated in the docs:

Each entry of the data configuration will have its own weight. When building batches, we’ll sequentially take weight example from each corpus.
Note: don’t worry about batch homogeneity/heterogeneity, the pooling mechanism is here for that reason. Instead of building batches one at a time, we will load pool_factor of batches worth of examples, sort them by length, build batches and then yield them in a random order.

Understood, however I don’t really understand how weights work. Say:

data:
    commoncrawl:
        path_src: data/wmt/commoncrawl.de-en.en
        path_tgt: data/wmt/commoncrawl.de-en.de
        transforms: [sentencepiece, filtertoolong]
        weight: 23
    europarl:
        path_src: data/wmt/europarl-v7.de-en.en
        path_tgt: data/wmt/europarl-v7.de-en.de
        transforms: [sentencepiece, filtertoolong]
        weight: 19
    news_commentary:
        path_src: data/wmt/news-commentary-v11.de-en.en
        path_tgt: data/wmt/news-commentary-v11.de-en.de
        transforms: [sentencepiece, filtertoolong]
        weight: 3

I guess it takes 23 examples from commoncrawl, 19 examples from europarl and 3 from news_commentary. If batch size is 128, then it is repeated many times? What if it is not multiple?
How can I give equal weight to all corpora? What if some corpus is small, will it be seen in training more than once before the biggest corpus is completely seen in training?

I guess it takes 23 examples from commoncrawl, 19 examples from europarl and 3 from news_commentary.

Yes.

If batch size is 128, then it is repeated many times?

Yes. The batch building process is basically an infinite loop.

What if it is not multiple?

Batch building and weighting is not strict. See the part about pool_factor.

How can I give equal weight to all corpora?

In the sense that every corpus will be seen equally frequently? Just set all weights to 1. It will then sample one example per corpus iteratively.

What if some corpus is small, will it be seen in training more than once before the biggest corpus is completely seen in training?

Yes. If you want to replicate the ‘concatenation’, you can give some approximate weights based on the size of your datasets. E.g. if dataset A is 10 times bigger than dataset B, then weight A = 10 and weight B = 1.

Perfect! That solves all my doubts! Thanks for your time

Is it possible to have more than one validation dataset?

No that’s not handled for now.