I’d like to post an update about recent additions to CTranslate2. The last similar post was in June 2021 when we released the version 2.0, so I figured it would be interesting to summarize the latest changes. It turns out there are many!
In addition to Linux and macOS, we now also build and publish Python wheels for Windows and Linux ARM. The Windows wheels have the same level of features as the Linux ones, including GPU support.
By design, CTranslate2 can run models trained by different NMT toolkits. We added a converter for Marian, in particular to make the 1000+ pretrained models from OPUS-MT usable in CTranslate2. The conversion is easy and does not require additional dependencies:
ct2-opus-mt-converter --model_dir opus_model --output_dir ct2_model
(If you are using one of these pretrained models in your application, make sure to respect their CC BY 4.0 license.)
This work also enables proper performance benchmarks with Marian as we can compare the translation speed using exactly the same model. See the benchmark table for a comparison (tl;dr: CTranslate2 is faster and uses much less memory).
Scoring is sometimes used in the model training procedure, for example to filter the training data, so there are benefits in making this task fast. We added the methods
Models trained with source factors are now supported since it was requested multiple times. To simplify their integration, the source factors should be attached to the tokens directly. There is no separate argument.
translator.translate_batch([["hello￨C", "world￨L", "!￨N"]])
In text file:
hello￨C world￨L !￨N
There is an error if the number of factors does not match what the model expects.
You can now run asynchronous translations from Python. In this mode,
Translator.translate_batch returns immediately and you can retrieve the results later:
async_results = translator.translate_batch(batch, asynchronous=True)
async_results.result() # This method blocks until the result is available.
Asynchronous translation is also one way to benefit from
inter_threads or multi-GPU parallelism.
Some new translation options were added:
disable_unk: disallow the generation of the
max_input_length: truncate the inputs after this many tokens in order to limit the maximum memory usage (1024 by default).
repetition_penalty: penalize the tokens that were already generated as described in Keskar et al. 2019.
And of course, we are always looking to further increase the translation speed and reduce the memory usage. Since version 2.0, many small optimizations were implemented especially on GPU:
We also improved the correctness of the quantization formula. The impact is very subtle but the numerical results are more accurate. The new formula is only used for models converted after version 2.12.
I feel like language models, especially generative language models, are a natural continuation for this project. Technically the model support is there, but I would need to think about an integration that makes sense.
What are your ideas for CTranslate2? Is there anything you would like to see improved or think is missing from the current implementation? Thanks!