We made tests with the latest CTranslate2 (2.0) release and found that translation speed on Geforce RTX 2080 is 25% faster than 3090 on single GPU. We loaded 14 language models (around 4.7Gb in memory ) in both GPU.
How is it can be ?
We tested “int8” models with “int8” and “float” parameters. With beam_size 1 and 2. Same results - 2080 is always faster 3090.
2080: Driver Version: 460.32.03 CUDA Version: 11.2
3090: Driver Version: 460.73.01 CUDA Version: 11.2
I suppose your translation server is splitting the input text. How does it batch the sentences? Or are you running the translation sentence by sentence?
Tokenizer divides text into 760 parts (sentences) and after that executes translate_batch method 12 times (we have limit 64 elements array per call to prevent out of memory error).
We tested “int8” models with “int8” and “float” parameters.
So you tried compute_type="int8" and compute_type="float", is that right? Did you also try with “float16”?
we have limit 64 elements array per call to prevent out of memory error
Note that the parameter max_batch_size can do that for you. You can pass all sentences to translate_batch and set max_batch_size to 64. This will improve performance as we do some reordering by length internally.
I definitely confirm this. We earlier tried to make an outside loop, and then when we passed all the sentences to translate_batch, the performance has dramatically improved.
According to the performance recommendations, you might also want to try batch_type="tokens". In this case, you can increase the max_batch_size.
Depending on your environment, this kind of setup allows applying some sort of caching. For example, if you are in a CAT tool, you might send multiple segments to translate as a batch while the translator works on the current segment.
I know that your API is completely the same; however, this does not mean the two machines handle everything in the same way. In order to be able to eliminate factors, I would first make a test for CTranslate with a very simple Python script (no API or Docker), and compare the performance between the two machines. I would also take into consideration other specifications of the machine, not only GPUs.
Finally, it is known for some recent types of GPUs to have mysterious issues due to lack of testing on these machine whether this is related to CUDA, deep learning frameworks, or other factors. See this post for example.