Shuffle samples during training

Sasanita · February 25, 2020, 6:09pm

hi all,
I did not find information in the forum about the shuffling of sentences during the training.
I fount this parameter but I am not sure how is it working:

  # (optional) The number of elements from which to sample during shuffling (default: 500000).
  # Set 0 or null to disable shuffling, -1 to match the number of training examples.
  sample_buffer_size: 500000

Could someone explain the behavior of this parameter?
the shuffling is done with all sentences or at batch level?
regards and thanks in advance

guillaumekln · February 26, 2020, 9:06am

Hi,

The shuffling is applied at the sentence level. In your example, the training will keep a buffer of size 500000 with shuffled sentences. The next batch will pick N sentences from this buffer.

The buffer is regularly refilled with sentences loaded from the file.

Sasanita · February 26, 2020, 11:36am

Thanks a lot for the reply
I don’t understand when the shuff is done. If I have 10M sentences, 500k of first sentences are taken, shuffed and the batches are taken from this 500k? Later, other 500k sentences are taken and shuffed?

So, if I have 10M sentences and the parameter is fixed to 10M, the shuff is done after each epoch?

thanks in advance

guillaumekln · February 26, 2020, 12:05pm

Correct.

In practice it’s not exactly the first 500k because when the shuffle buffer size is smaller than the dataset size, the training will split the dataset in 10M/500k=20 shards and visit them in a random order.

Yes. If you have enough memory, it’s best to set the buffer to the size of the training dataset so that you get an uniform shuffling.

tel34 · March 12, 2020, 11:41am

Hi Sasanita, I’m considering this issue now with a fairly large dataset. What shuffle buffer size did you decide on in the end?

Sasanita · March 12, 2020, 4:23pm

Hi @tel34, I just use default value of sample_buffer_size: 500000.
regards!

tel34 · March 16, 2020, 2:44pm

Hi @Sasanita. Just reporting that I also took 500,000 and that there was a significant improvement in BLEU as against setting the shuffle_buffer_size to null.
Regards,
Terence

Sasanita · March 16, 2020, 4:17pm

Thanks for sharing that @tel34
Regards