I’m working on a NMT task and try to make source and target language inputs to the encoder and decoder have the same length (usually they do not have the same length). Do anybody know where to modify the source code so that source and target language sentences can have the same length after padding?
Padding is handled via
torchtext. There are also some mechanisms to reduce padding (and improve training performance), like
pool, which retrieves N x batch_size examples, sort them by length, and builds batches accordingly.
To override this, you may want to have a look in
onmt.inputters.inputter, where we actually build and yield
torchtext.data.Batch objects: somewhere like
MultipleDatasetIterator.__iter__ (depending on if you’re using single or multiple datasets).
Thanks. Really confused. I gave up…