Sorted VS unsorted corpus for training model

Dmitry · May 29, 2023, 10:36am

Hi there! Does it matter whether to train a model on a sorted by length corpus or unsorted corpus? If I train, for example, an english-russian model on a sorted by length and alphabet data will I get worse result in comparison to training the model on shuffled data in corpus. Thanks in advance!

guillaumekln · May 31, 2023, 11:08am

Hi,

If you are training a Transformer model with --auto_config, the full training data is loaded in memory and shuffled. So it does not matter whether the corpus is sorted or not.

For other cases, it can make a difference because the corpus may be loaded in multiple shards.

In general you should not sort the training corpus by length. If you want to do that to improve the training speed, note that the training already does some length bucketing while still ensuring a proper shuffling.

Dmitry · May 31, 2023, 1:48pm

Many thanks for answering! I’ll use sample_buffer_size = -1.