I definitely confirm this. We earlier tried to make an outside loop, and then when we passed all the sentences to
translate_batch, the performance has dramatically improved.
According to the performance recommendations, you might also want to try
batch_type="tokens". In this case, you can increase the
Depending on your environment, this kind of setup allows applying some sort of caching. For example, if you are in a CAT tool, you might send multiple segments to translate as a batch while the translator works on the current segment.
I know that your API is completely the same; however, this does not mean the two machines handle everything in the same way. In order to be able to eliminate factors, I would first make a test for CTranslate with a very simple Python script (no API or Docker), and compare the performance between the two machines. I would also take into consideration other specifications of the machine, not only GPUs.
Finally, it is known for some recent types of GPUs to have mysterious issues due to lack of testing on these machine whether this is related to CUDA, deep learning frameworks, or other factors. See this post for example.