Need support for Continuous batching of incoming requests

karthiktheking · September 8, 2023, 11:01am

Need support for continuous batching, which promises to improve the performance of the decoder part/decoder-only model.

guillaumekln · September 19, 2023, 8:09am

If you are interested in this feature for CTranslate2, this is an open issue:

github.com/OpenNMT/CTranslate2

Continuous batching

opened 11:58PM - 06 Jul 23 UTC

andreapiso

enhancement

Recently, a lot of benchmarks point to the fact that if you want to serve your m…odels behind an API, continuous batching grants higher throughput and lower latency compared to static batching. Some examples of systems that implement continous batching: - text-generation-inference from huggingface: https://github.com/huggingface/text-generation-inference - vLLM (which also include an inference engine) https://github.com/vllm-project/vllm - Ray from the next 2.6 version In order to enable continuous batching, it is necessary to be able to: 1) add requests to an existing running batch, if there are enough resources to take it (compared to static batching where requests need to be submitted all together) 2) remove a request early from the batch when it reaches the stop token (as opposed to returning all requests at the same time). Is this concept compatible with CTranslate2 architecture? I am keen to build an inference engine on top of CTranslate2, would love to hear some thoughts around this before I deep dive into it.