Sorted VS unsorted corpus for training model

Hi there! Does it matter whether to train a model on a sorted by length corpus or unsorted corpus? If I train, for example, an english-russian model on a sorted by length and alphabet data will I get worse result in comparison to training the model on shuffled data in corpus. Thanks in advance!

Hi,

If you are training a Transformer model with --auto_config, the full training data is loaded in memory and shuffled. So it does not matter whether the corpus is sorted or not.

For other cases, it can make a difference because the corpus may be loaded in multiple shards.

In general you should not sort the training corpus by length. If you want to do that to improve the training speed, note that the training already does some length bucketing while still ensuring a proper shuffling.

Many thanks for answering! I’ll use sample_buffer_size = -1.

1 Like