Where to modify the padding for the encoder decoder input in Transformer

haoran · September 26, 2019, 2:14pm

I’m working on a NMT task and try to make source and target language inputs to the encoder and decoder have the same length (usually they do not have the same length). Do anybody know where to modify the source code so that source and target language sentences can have the same length after padding?

francoishernandez · September 27, 2019, 7:11am

Hi,
Padding is handled via torchtext. There are also some mechanisms to reduce padding (and improve training performance), like pool, which retrieves N x batch_size examples, sort them by length, and builds batches accordingly.
To override this, you may want to have a look in onmt.inputters.inputter, where we actually build and yield torchtext.data.Batch objects: somewhere like OrderedIterator.__iter__ or MultipleDatasetIterator.__iter__ (depending on if you’re using single or multiple datasets).

haoran · October 5, 2019, 1:40pm

Thanks. Really confused. I gave up…