Independent CTranslate2 benchmarking

argosopentech · June 11, 2023, 2:19pm

https://twitter.com/abacaj/status/1667679842416881664

I feel like not enough people mention this but ct2 can quantize and run flan-t5, falcon and MPT models. Using half the memory at int8, it is > 2x faster than HF transformers with fp16 and significantly faster than HF transformers load_in_8bit

CTranslate2 flan-t5-xxl

13.68 ms/token
12GB memory

Hugging Face flan-t5-xxl

30.5 ms/token
22.5GB memory

Mongoose · June 11, 2023, 4:18pm

I am trying to run falcon-7B using ctranslate2 and OpenNMT-py server. I have converted the model successfully. However, I can’t figure out the config.json file for the OpenNMT-py server.
What did you use for running the inference on the falcon model?