CTranslate2: a bunch of new features to start 2022

Hi there!

I’d like to post an update about recent additions to CTranslate2. The last similar post was in June 2021 when we released the version 2.0, so I figured it would be interesting to summarize the latest changes. It turns out there are many!

Python wheels for Windows and Linux ARM

In addition to Linux and macOS, we now also build and publish Python wheels for Windows and Linux ARM. The Windows wheels have the same level of features as the Linux ones, including GPU support.

Converter for Marian models and the collection 1000+ pretrained models from OPUS-MT

By design, CTranslate2 can run models trained by different NMT toolkits. We added a converter for Marian, in particular to make the 1000+ pretrained models from OPUS-MT usable in CTranslate2. The conversion is easy and does not require additional dependencies:

ct2-opus-mt-converter --model_dir opus_model --output_dir ct2_model

(If you are using one of these pretrained models in your application, make sure to respect their CC BY 4.0 license.)

This work also enables proper performance benchmarks with Marian as we can compare the translation speed using exactly the same model. See the benchmark table for a comparison (tl;dr: CTranslate2 is faster and uses much less memory).

M2M-100 multilingual model

Similarly, we extended the Fairseq converter to support the M2M-100 multilingual model. Check out this tutorial by @ymoslem for how to convert and use this model.

Methods to score existing translations

Scoring is sometimes used in the model training procedure, for example to filter the training data, so there are benefits in making this task fast. We added the methods Translator.score_batch and Translator.score_file.

Source factors (a.k.a. source features)

Models trained with source factors are now supported since it was requested multiple times. To simplify their integration, the source factors should be attached to the tokens directly. There is no separate argument.

In API:

translator.translate_batch([["hello│C", "world│L", "!│N"]])

In text file:

hello│C world│L !│N

There is an error if the number of factors does not match what the model expects.

Asynchronous translations from Python

You can now run asynchronous translations from Python. In this mode, Translator.translate_batch returns immediately and you can retrieve the results later:

async_results = translator.translate_batch(batch, asynchronous=True)
async_results[0].result()  # This method blocks until the result is available.

Asynchronous translation is also one way to benefit from inter_threads or multi-GPU parallelism.

New translation options

Some new translation options were added:

  • disable_unk: disallow the generation of the <unk> token.
  • max_input_length: truncate the inputs after this many tokens in order to limit the maximum memory usage (1024 by default).
  • repetition_penalty: penalize the tokens that were already generated as described in Keskar et al. 2019.

Performance and correctness improvements

And of course, we are always looking to further increase the translation speed and reduce the memory usage. Since version 2.0, many small optimizations were implemented especially on GPU:

image

We also improved the correctness of the quantization formula. The impact is very subtle but the numerical results are more accurate. The new formula is only used for models converted after version 2.12.


What’s next?

I feel like language models, especially generative language models, are a natural continuation for this project. Technically the model support is there, but I would need to think about an integration that makes sense.

What are your ideas for CTranslate2? Is there anything you would like to see improved or think is missing from the current implementation? Thanks!

8 Likes

Thanks a lot, Guillaume, for all these great efforts and updates!

How would you approach this? Do you mean you can for example take a crawled corpus, and filter out bad translations that way?

For the record, here is a snippet of code (the same tokenization/sub-wording of the scoring model should be used).

import ctranslate2

translator = ctranslate2.Translator("fr_en_model_dir")
scores = translator.score_batch([["merci", "beaucoup"]], [["thanks", "a", "lot"]])

Example output of scores:

[[-11.929224014282227, -7.074151039123535, -9.238308906555176]]


As for possible features, something I missed during working with M2M-100 models is figuring out the model architecture. If the model is M2M-100, it will definitely require a start and end token. So a ctranslate2.model_type function would be nice, unless there is a current way to figure this out.

Thanks again!
Yasmin

2 Likes

Yes. You can try scoring a large corpus using a trained model and then sort by score. You will find that low scores very often correspond to misaligned data.

Thanks for this idea. I will think about attaching some kind of metadata to converted models.

3 Likes

For reference, a quite nice paper about such scoring method: Dual Conditional Cross-Entropy Filtering of Noisy Parallel Corpora - ACL Anthology

3 Likes

Great work! Generative language models would be useful and open up many use cases.

Like you said you could try modeling it as a translation, ex:

"The quick brown fox jumped over the lazy " -> "dog."

More explicit support would probably be a better long term solution though.

For the record, there is a new set of OPUS models here:

For example, the new set has an English-to-Arabic model, which I could not find in the old set.

Most models work fine with the CTranslate2 converter. For some reason, I could not run the “bt” versions of the models, though.

Please note also that some models require the source to be augmented with the target code, some do not. The description of each model clarifies this.

Kind regards,
Yasmin

What was the issue? Did you get an error during conversion or did the translation return incorrect results?

Dear Guillaume,

For example, in the FRA-ENG model, I now tried opus+bt-2021-04-30.zip and it works fine. The issue is rather in the model called opusTCv20210807+bt_transformer-big_2022-03-09.zip

Steps to reproduce:

ct2-opus-mt-converter --model_dir . --output_dir fren_ctranslate2
  • You will get the error:
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.9/bin/ct2-opus-mt-converter", line 8, in <module>
    sys.exit(main())
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/ctranslate2/converters/opus_mt.py", line 35, in main
    converter.convert_from_args(args)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/ctranslate2/converters/converter.py", line 31, in convert_from_args
    return self.convert(
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/ctranslate2/converters/converter.py", line 45, in convert
    model_spec = self._load()
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/ctranslate2/converters/marian.py", line 30, in _load
    vocabs = list(map(_load_vocab, self._vocab_paths))
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/ctranslate2/converters/marian.py", line 116, in _load_vocab
    token, idx = line.rsplit(":", 1)
ValueError: not enough values to unpack (expected 2, got 1)
  • Open the file opusTCv20210807+bt.spm32k-spm32k.vocab.yml in a text editor. Scroll to the end. You will notice two extra empty lines. Delete them, and save.
  • Now, run the converter command again. The model is converted without any errors.
  • However, if you use the model for translation, it gives random text with no sense.

Note that the models from the same series (e.g. opus+bt-2021-04-30.zip) give perfect translation for the same sentence. It might be an issue with bt_transformer-big_2022 models, as I tried a couple of them, and they all have issues with translation.

Kind regards,
Yasmin

1 Like

Thank you for the details. The parsing of the vocabulary file should be revised. It should be fixed with this change:

1 Like