Quantization Kernels in lower precision

Runjie · March 10, 2025, 11:54am

ctranslate2 currently only supports 8-bit integer quantization (INT8) and 16-bit floating point (FP16/BF16), and does not yet provide native support for 4-bit quantization. I would like to know if the developers of ctranslate2 will continue to develop for lower-bit quantization operators in the future? At the same time, I also want to develop these operators.

alexeir · April 23, 2025, 8:52am

You need 4 bit quantization to make LLM models run faster on CTranslate2, right?

Instead of that we make small language models (110mb) for every language pair (for example English → German) French - > English) and use 8 bit to run. This approach allows to translate 30 000 character / second on RTX 3090 GPU and deploy 40 languages at once.