Shuffle samples during training

hi all,
I did not find information in the forum about the shuffling of sentences during the training.
I fount this parameter but I am not sure how is it working:

  # (optional) The number of elements from which to sample during shuffling (default: 500000).
  # Set 0 or null to disable shuffling, -1 to match the number of training examples.
  sample_buffer_size: 500000

Could someone explain the behavior of this parameter?
the shuffling is done with all sentences or at batch level?
regards and thanks in advance

Hi,

The shuffling is applied at the sentence level. In your example, the training will keep a buffer of size 500000 with shuffled sentences. The next batch will pick N sentences from this buffer.

The buffer is regularly refilled with sentences loaded from the file.

1 Like

Thanks a lot for the reply
I don’t understand when the shuff is done. If I have 10M sentences, 500k of first sentences are taken, shuffed and the batches are taken from this 500k? Later, other 500k sentences are taken and shuffed?

So, if I have 10M sentences and the parameter is fixed to 10M, the shuff is done after each epoch?

thanks in advance

Correct.

In practice it’s not exactly the first 500k because when the shuffle buffer size is smaller than the dataset size, the training will split the dataset in 10M/500k=20 shards and visit them in a random order.

Yes. If you have enough memory, it’s best to set the buffer to the size of the training dataset so that you get an uniform shuffling.

1 Like

Hi Sasanita, I’m considering this issue now with a fairly large dataset. What shuffle buffer size did you decide on in the end?

Hi @tel34, I just use default value of sample_buffer_size: 500000.
regards!

Hi @Sasanita. Just reporting that I also took 500,000 and that there was a significant improvement in BLEU as against setting the shuffle_buffer_size to null.
Regards,
Terence

Thanks for sharing that @tel34
Regards