https://twitter.com/abacaj/status/1667679842416881664
I feel like not enough people mention this but ct2 can quantize and run flan-t5, falcon and MPT models. Using half the memory at int8, it is > 2x faster than HF transformers with fp16 and significantly faster than HF transformers load_in_8bit
CTranslate2 flan-t5-xxl
- 13.68 ms/token
- 12GB memory
Hugging Face flan-t5-xxl
- 30.5 ms/token
- 22.5GB memory