I’m working on a NMT task and try to make source and target language inputs to the encoder and decoder have the same length (usually they do not have the same length). Do anybody know where to modify the source code so that source and target language sentences can have the same length after padding?
Hi,
Padding is handled via torchtext
. There are also some mechanisms to reduce padding (and improve training performance), like pool
, which retrieves N x batch_size examples, sort them by length, and builds batches accordingly.
To override this, you may want to have a look in onmt.inputters.inputter
, where we actually build and yield torchtext.data.Batch
objects: somewhere like OrderedIterator.__iter__
or MultipleDatasetIterator.__iter__
(depending on if you’re using single or multiple datasets).
1 Like
Thanks. Really confused. I gave up…