As of today, batches are handled as a number of segments.
Incidently, it limits the sequence size to fit the number of timesteps in memory.
When dealing with BPE/subwords or character level modeling it really becomes an issue.
Some other toolkits use a number of token per batch.
Therefore, it can handle very long sequences in small numbers and very short sequences in high numbers.
It would not only enable to take into account much longer sequences but also to optimize the memory usage.
For OpenNMT-tf, I don’t know if it is a good solution, but how about let users decide whether to use memory_swap, which is a parameter of tf.contrib.seq2seq.dynamic_decode and tf.nn.dynamic_rnn, etc?
When doing character-level training, I could set a little bit more batch size (10→20), but the host memory occupation was a disaster: about 90GB VSZ and 40GB RSS.