Preprocess : shuffling.. sorting

What is really doing ONMT with the training data ? The log seems to report about a sorting by size. Are the sentences always trained ordered by size ?

This process comes from 2 constraints at the batch level:

  • shuffling: sentences within a batch should come from different parts of the corpus
  • sorting: sentences within a batch should have the same source length (i.e. without padding)

Then during the training, batches are randomly ordered so the whole corpus is seen in a random order. It just happens that sentences have the same source length within each batch.

1 Like