Preprocess : shuffling.. sorting

Etienne38 · February 7, 2017, 1:36pm

What is really doing ONMT with the training data ? The log seems to report about a sorting by size. Are the sentences always trained ordered by size ?

guillaumekln · February 7, 2017, 1:56pm

This process comes from 2 constraints at the batch level:

shuffling: sentences within a batch should come from different parts of the corpus
sorting: sentences within a batch should have the same source length (i.e. without padding)

Then during the training, batches are randomly ordered so the whole corpus is seen in a random order. It just happens that sentences have the same source length within each batch.