We just released the version 3.0 of CTranslate2! Here’s an overview of the main changes:
First speech-to-text model: Whisper
The main highlight of this version is the integration of the Whisper speech-to-text model that was published by OpenAI a few weeks ago.
Its architecture is very similar to a text-to-text Transformer model but it uses Conv1D layers to transform the audio features. On GPU, Conv1D layers are implemented using cuDNN which is a new optional dependency.
The current implementation already supports many CTranslate2 features and optimizations such as quantization, asynchronous execution, decoding with random sampling, etc.
See a conversion and usage example in the Transformers guide.
Removal of the decoding options normalize_scores
and allow_early_exit
These options are removed to provide a better default behavior and improve consistency with other frameworks.
- The scores are now always divided by
pow(length, length_penalty)
withlength_penalty
defaulting to 1 - The beam search will exit early only when no penalties are used
The outputs are expected to be slightly different following this change.
Compatibility with OpenNMT-py V3 checkpoints
As mentioned in OpenNMT-py v3.0 is out!, the latest OpenNMT-py version changed how the vocabularies are saved in the checkpoints. The CTranslate2 converter have been updated accordingly while still supporting older checkpoints.
New config.json
file in the model directory
Newly converted models will now include an additional configuration file: config.json
. This file is meant to contain non structural model parameters such as parameters related to the input and the vocabulary, for example:
{
"add_source_bos": false,
"add_source_eos": false,
"bos_token": "<s>",
"decoder_start_token": "</s>",
"eos_token": "</s>",
"unk_token": "<unk>"
}
In the future, the file could contain other useful information about the model and even set the default translation/generation options to use for this model.
Passing and returning N-dimensional arrays in Python
The Python module exposes a new StorageView
class which is used to pass or return N-dimensional arrays, for example:
- to pass audio features to the Whisper model
- to return the full LM output logits from the new method
Generator.forward_batch
The object implements the Array Interface and CUDA Array Interface meaning you can convert arrays from or to Numpy and PyTorch without copying the underlying data. See an example in the class documentation.
This major version also comes with other breaking changes that should not impact most usages. See the full release note for more details.