CTranslate2 (2.0 release) - Geforce RTX 2080 Ti is faster RTX 3090?

We made tests with the latest CTranslate2 (2.0) release and found that translation speed on Geforce RTX 2080 is 25% faster than 3090 on single GPU. We loaded 14 language models (around 4.7Gb in memory ) in both GPU.

How is it can be ?

We tested “int8” models with “int8” and “float” parameters. With beam_size 1 and 2. Same results - 2080 is always faster 3090.

2080: Driver Version: 460.32.03 CUDA Version: 11.2
3090: Driver Version: 460.73.01 CUDA Version: 11.2

Ubuntu 20.04

Running in Docker container.

Hi,

Thanks for this test. I have some questions:

  1. What model size are you using (e.g. Transformer base or big)?
  2. How exactly did you measure the speed?
  3. What batch size did you set?

Here answers:

  1. Transformer Big
  2. Command for running tests:

ab -T ‘application/json’ -r -n 1 -c 1 translate_test.txt http://localhost:8080/translate

translate_text.txt content is here: Paste ofCode

http://localhost:8080/translate is our web server. It was the same during tests on 2080 and 3090 so it shouldn’t affect the perfomance

  1. Batch size = default value in python library

max_batch_size: int = 0

I suppose your translation server is splitting the input text. How does it batch the sentences? Or are you running the translation sentence by sentence?

Our Tokenizer works this way:

  • Firstly, it divides a text by lines.
  • Then divides each line into parts by sentence end splitters (dot with special conditions, question mark, exclamation mark).
  • If there are no sentence end splitters or text part is still bigger than 50 words it divides it by splitters inside sentence (comma, braces, etc.)
  • If after that a part is still bigger than 50 words tokenizer divides into smaller parts by cutting words
    while part length is more than 50 words

Do you known the actual number of sentences you are passing to translate_batch?

Tokenizer divides text into 760 parts (sentences) and after that executes translate_batch method 12 times (we have limit 64 elements array per call to prevent out of memory error).

1 Like

Ok, thanks. So the usage looks good to me.

We tested “int8” models with “int8” and “float” parameters.

So you tried compute_type="int8" and compute_type="float", is that right? Did you also try with “float16”?

we have limit 64 elements array per call to prevent out of memory error

Note that the parameter max_batch_size can do that for you. You can pass all sentences to translate_batch and set max_batch_size to 64. This will improve performance as we do some reordering by length internally.

1 Like

I definitely confirm this. We earlier tried to make an outside loop, and then when we passed all the sentences to translate_batch, the performance has dramatically improved.

According to the performance recommendations, you might also want to try batch_type="tokens". In this case, you can increase the max_batch_size.

Depending on your environment, this kind of setup allows applying some sort of caching. For example, if you are in a CAT tool, you might send multiple segments to translate as a batch while the translator works on the current segment.

I know that your API is completely the same; however, this does not mean the two machines handle everything in the same way. In order to be able to eliminate factors, I would first make a test for CTranslate with a very simple Python script (no API or Docker), and compare the performance between the two machines. I would also take into consideration other specifications of the machine, not only GPUs.

Finally, it is known for some recent types of GPUs to have mysterious issues due to lack of testing on these machine whether this is related to CUDA, deep learning frameworks, or other factors. See this post for example.

Kind regards,
Yasmin

1 Like