Quantized models not supported on macOS?

Hey there,

I’ve created a quantized version of a model with:

ct2-transformers-converter --model Falconsai/text_summarization --output_dir models/falcon-text-summarization-quantized --quantization int8

But when I use it in C++, I get this warning (added a VERBOSE flage)

[2023-12-07 17:03:43.040] [ctranslate2] [thread 11693035] [info] CPU: ARM (NEON=true)
[2023-12-07 17:03:43.040] [ctranslate2] [thread 11693035] [info]  - Selected ISA: NEON
[2023-12-07 17:03:43.040] [ctranslate2] [thread 11693035] [info]  - Use Intel MKL: false
[2023-12-07 17:03:43.040] [ctranslate2] [thread 11693035] [info]  - SGEMM backend: Accelerate (packed: false)
[2023-12-07 17:03:43.040] [ctranslate2] [thread 11693035] [info]  - GEMM_S16 backend: none (packed: false)
[2023-12-07 17:03:43.040] [ctranslate2] [thread 11693035] [info]  - GEMM_S8 backend: none (packed: false, u8s8 preferred: false)
[2023-12-07 17:03:43.093] [ctranslate2] [thread 11693035] [info] Loaded model models/falcon-text-summarization-quantized on device cpu:0
[2023-12-07 17:03:43.093] [ctranslate2] [thread 11693035] [info]  - Binary version: 6
[2023-12-07 17:03:43.093] [ctranslate2] [thread 11693035] [info]  - Model specification revision: 7
[2023-12-07 17:03:43.093] [ctranslate2] [thread 11693035] [info]  - Selected compute type: float32
[2023-12-07 17:03:43.093] [ctranslate2] [thread 11693035] [warning] The compute type inferred from the saved model is int8_float32, but the target device or backend do not support efficient int8_float32 computation. The model weights have been automatically converted to use the float32 compute type instead.

Any ideas why? Thanks

I compiled ctranslate2 with these flags, to get accelerate support:

-DCMAKE_OSX_ARCHITECTURES=arm64 -DWITH_ACCELERATE=ON -DWITH_MKL=OFF -DOPENMP_RUNTIME=NONE

and this is how I call the lib (extract):

  const std::string model_path("models/" + std::string(model));
  const ctranslate2::models::ModelLoader model_loader(model_path);

  ctranslate2::Translator translator(model_loader);

  auto start = std::chrono::high_resolution_clock::now();
  auto results = translator.translate_batch(text);
  auto end = std::chrono::high_resolution_clock::now();
  std::chrono::duration<double> duration = end - start;

  std::string joinedString = "";
  int tokens = 0;

  for (const auto& token : results[0].output()) {
    joinedString += token;
    tokens++;
  }
  token2Text(joinedString);
  std::cout << cleanup(joinedString) << std::endl;
  std::cout << tokens << " tokens generated (" << std::fixed << std::setprecision(2)
              << (tokens / duration.count()) << " token/s)" << std::endl;

By the way, I am displaying a number of token per second, I was wondering if how I do it is correct, or if there are some better way.

Thanks.

Hi,

For int8 on macOS you also need ruy:

-DCMAKE_OSX_ARCHITECTURES=arm64 -DWITH_ACCELERATE=ON -DWITH_MKL=OFF -DOPENMP_RUNTIME=NONE -DWITH_RUY=ON

It may be a good idea to also use -DOPENMP_RUNTIME=COMP, but this requires to install and use llvm and libomp with brew.

it worked, thanks so much!