Deploying ctranslate2 on production

kvasilopoulos · April 18, 2022, 8:23am

Hi Everyone,

I was wondering if there are any resources that can point to how ctranslate2 can be deployed for production use and be scaled accordingly. I am most interested in microbatching, and autoscalling in a k8s environment.

It does not fall under the most common supported ML frameworks category for model serving in frameworks like bentoml/kserve/etc. So I was wondering whether is a better idea to use ray-serve or something similar which is framework agnostic.

Any insights on that topic?

guillaumekln · April 25, 2022, 8:27am

Hello,

I’m not aware of such resources for CTranslate2 specifically. The API surface of CTranslate2 is relatively simple so it could probably fit in many serving frameworks. Let us know if you encounter any issues during this integration.

Here are some possibly useful features available in CTranslate2:

A single translation instance can run multiple translations in parallel, either on multiple GPUs or CPU cores, see this documentation.
Usually microbatching is implemented by the serving frameworks, but there are also ways to do it with the CTranslate2 C++ API: CTranslate2/buffered_translation_wrapper.h at master · OpenNMT/CTranslate2 · GitHub