As of today, batches are handled as a number of segments.
Incidently, it limits the sequence size to fit the number of timesteps in memory.
When dealing with BPE/subwords or character level modeling it really becomes an issue.
Some other toolkits use a number of token per batch.
Therefore, it can handle very long sequences in small numbers and very short sequences in high numbers.
It would not only enable to take into account much longer sequences but also to optimize the memory usage.
PR welcome ! I know…