Independent CTranslate2 benchmarking

I feel like not enough people mention this but ct2 can quantize and run flan-t5, falcon and MPT models. Using half the memory at int8, it is > 2x faster than HF transformers with fp16 and significantly faster than HF transformers load_in_8bit

CTranslate2 flan-t5-xxl

  • 13.68 ms/token
  • 12GB memory

Hugging Face flan-t5-xxl

  • 30.5 ms/token
  • 22.5GB memory

I am trying to run falcon-7B using ctranslate2 and OpenNMT-py server. I have converted the model successfully. However, I can’t figure out the config.json file for the OpenNMT-py server.
What did you use for running the inference on the falcon model?