Randomly shuffling examples (OpenNMT-py v2.0)

anderleich · September 29, 2020, 11:16am

Hi,

I’m carrying out some experiments using the brand new Pytorch v2 release. I was wondering if some shuffling is done prior training or do we still need to randomly shuffle data before building the vocabulary as with the previous version before the preprocessing step?? If shuffling is performed, do we need to set some flag?

Thanks

francoishernandez · September 29, 2020, 3:22pm

Shuffling is not performed when reading the data.
The only thing approaching this is the ‘pooling’ mechanism (pool_factor opt) which allows to load the equivalent of pool_factor batches of examples, sort these examples by length (for optimization purposes), and shuffle these batches between themselves before yielding them.
Performing a corpus-wide shuffling on the data on the fly is not trivial, at least not without loading the whole data in memory (which we often can’t / don’t want to do).
Also, it’s not that much of a burden to shuffle datasets prior to using OpenNMT.

anderleich · September 30, 2020, 6:13pm

Thanks @francoishernandez!
Is pool_factor applied by default?

francoishernandez · September 30, 2020, 6:46pm

Yes, with value 8192: https://github.com/OpenNMT/OpenNMT-py/blob/8b073fb2a047509ff590839b1194a155ec1a50bf/onmt/opts.py#L469

You probably want to increase the bucket_sizeas well: https://github.com/OpenNMT/OpenNMT-py/blob/8b073fb2a047509ff590839b1194a155ec1a50bf/onmt/opts.py#L594-L595

anderleich · February 15, 2021, 12:41pm

When using different corpora are the examples taken randomly from each set to form a batch? When concating differet corpora into a single file, I’ve observed if no shuffling is performed, that some checkpoints have learned a specifi corpus while other checpoints another corpus. That’s why I ask this question, can we be sure all batches will contain information from all the corpora? This is to avoid cocating all corpora and shuffling it before training.

Thanks

francoishernandez · February 15, 2021, 1:06pm

As stated in the docs:

Each entry of the data configuration will have its own weight. When building batches, we’ll sequentially take weight example from each corpus.
Note: don’t worry about batch homogeneity/heterogeneity, the pooling mechanism is here for that reason. Instead of building batches one at a time, we will load pool_factor of batches worth of examples, sort them by length, build batches and then yield them in a random order.

anderleich · February 15, 2021, 1:10pm

Understood, however I don’t really understand how weights work. Say:

data:
    commoncrawl:
        path_src: data/wmt/commoncrawl.de-en.en
        path_tgt: data/wmt/commoncrawl.de-en.de
        transforms: [sentencepiece, filtertoolong]
        weight: 23
    europarl:
        path_src: data/wmt/europarl-v7.de-en.en
        path_tgt: data/wmt/europarl-v7.de-en.de
        transforms: [sentencepiece, filtertoolong]
        weight: 19
    news_commentary:
        path_src: data/wmt/news-commentary-v11.de-en.en
        path_tgt: data/wmt/news-commentary-v11.de-en.de
        transforms: [sentencepiece, filtertoolong]
        weight: 3

I guess it takes 23 examples from commoncrawl, 19 examples from europarl and 3 from news_commentary. If batch size is 128, then it is repeated many times? What if it is not multiple?
How can I give equal weight to all corpora? What if some corpus is small, will it be seen in training more than once before the biggest corpus is completely seen in training?

francoishernandez · February 15, 2021, 1:18pm

I guess it takes 23 examples from commoncrawl, 19 examples from europarl and 3 from news_commentary.

Yes.

If batch size is 128, then it is repeated many times?

Yes. The batch building process is basically an infinite loop.

What if it is not multiple?

Batch building and weighting is not strict. See the part about pool_factor.

How can I give equal weight to all corpora?

In the sense that every corpus will be seen equally frequently? Just set all weights to 1. It will then sample one example per corpus iteratively.

What if some corpus is small, will it be seen in training more than once before the biggest corpus is completely seen in training?

Yes. If you want to replicate the ‘concatenation’, you can give some approximate weights based on the size of your datasets. E.g. if dataset A is 10 times bigger than dataset B, then weight A = 10 and weight B = 1.

anderleich · February 15, 2021, 1:20pm

Perfect! That solves all my doubts! Thanks for your time

anderleich · February 18, 2021, 8:45am

Is it possible to have more than one validation dataset?

francoishernandez · February 18, 2021, 10:49am

No that’s not handled for now.