Hey there,
I’ve created a quantized version of a model with:
ct2-transformers-converter --model Falconsai/text_summarization --output_dir models/falcon-text-summarization-quantized --quantization int8
But when I use it in C++, I get this warning (added a VERBOSE flage)
[2023-12-07 17:03:43.040] [ctranslate2] [thread 11693035] [info] CPU: ARM (NEON=true)
[2023-12-07 17:03:43.040] [ctranslate2] [thread 11693035] [info] - Selected ISA: NEON
[2023-12-07 17:03:43.040] [ctranslate2] [thread 11693035] [info] - Use Intel MKL: false
[2023-12-07 17:03:43.040] [ctranslate2] [thread 11693035] [info] - SGEMM backend: Accelerate (packed: false)
[2023-12-07 17:03:43.040] [ctranslate2] [thread 11693035] [info] - GEMM_S16 backend: none (packed: false)
[2023-12-07 17:03:43.040] [ctranslate2] [thread 11693035] [info] - GEMM_S8 backend: none (packed: false, u8s8 preferred: false)
[2023-12-07 17:03:43.093] [ctranslate2] [thread 11693035] [info] Loaded model models/falcon-text-summarization-quantized on device cpu:0
[2023-12-07 17:03:43.093] [ctranslate2] [thread 11693035] [info] - Binary version: 6
[2023-12-07 17:03:43.093] [ctranslate2] [thread 11693035] [info] - Model specification revision: 7
[2023-12-07 17:03:43.093] [ctranslate2] [thread 11693035] [info] - Selected compute type: float32
[2023-12-07 17:03:43.093] [ctranslate2] [thread 11693035] [warning] The compute type inferred from the saved model is int8_float32, but the target device or backend do not support efficient int8_float32 computation. The model weights have been automatically converted to use the float32 compute type instead.
Any ideas why? Thanks
I compiled ctranslate2 with these flags, to get accelerate support:
-DCMAKE_OSX_ARCHITECTURES=arm64 -DWITH_ACCELERATE=ON -DWITH_MKL=OFF -DOPENMP_RUNTIME=NONE
and this is how I call the lib (extract):
const std::string model_path("models/" + std::string(model));
const ctranslate2::models::ModelLoader model_loader(model_path);
ctranslate2::Translator translator(model_loader);
auto start = std::chrono::high_resolution_clock::now();
auto results = translator.translate_batch(text);
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> duration = end - start;
std::string joinedString = "";
int tokens = 0;
for (const auto& token : results[0].output()) {
joinedString += token;
tokens++;
}
token2Text(joinedString);
std::cout << cleanup(joinedString) << std::endl;
std::cout << tokens << " tokens generated (" << std::fixed << std::setprecision(2)
<< (tokens / duration.count()) << " token/s)" << std::endl;
By the way, I am displaying a number of token per second, I was wondering if how I do it is correct, or if there are some better way.
Thanks.