General questions to improve performance

Hello there again,

I want to improve my servers production with CT2 as much as possible and just want to have some clarification with some certain things. Right now I load an assortment of models with the same inter_threads and intra_threads (4 and 4 respectively with a 16 core cpu). My current use-case will have a source message request the server to translate the same-message to 6 different languages (6 different models, using translate_batch with sentence splitting). As a result of this, I find that each core/thread seems to share processing power so that instead of each request being completed sequentially (eg takes 1s to translate) it’ll finish translating all 6 and release them at once (after ~6s).
In regards to translate_batch, if I make translation request with 8 sentences (8 elements in passed in list) would the batch method use all 4 inter_threads twice or use one thread to complete that batch (no passed in batch_size atm).

My proposed solution right now to the described “issue” would be to count how many current batches there are being processed, as to have some translations finish sooner than others instead of all at once, but I’m not sure how this might affect performance if using translate_batch 6 times across 6 models helps performance.
Thanks for any insight anyone can provide!

1 Like

On that thought I was testing out intra_threads and inter_threads values and found that inter_threads being set to 1 and intra_threads being set to 128 would drastically increase translation times for multiple translations at a time (when 8 models are all translating at same time) while having normal translation times (1 translation for a long phrargraph) be slightly longer by (0.3s~).


One translation request will use the 4 intra_threads, not inter_threads. The documentation describes when the inter_threads are used: Multithreading and parallelism — CTranslate2 2.24.0 documentation

The general approach is:

  • Increase intra_threads to reduce the translation time for a single request (latency is improved)
  • Increase inter_threads to reduce the global translation time of multiple concurrent requests to the same model (throughput is improved)

You should avoid setting intra_threads to a value larger than the number of CPU cores.

1 Like

This is why I am a bit perplexed, as when I set my intra_threads to 128 (256 and above no longer improves) on a 16 core CPU I achieve much faster concurrent translation times (from 8s when 8 requests processed at once to 2s) than setting it to a value which follows logic and creates multiple inter_threads.
Meanwhile single translations which are not executed in parallel stay close to the same.

Would you know of any explanation for this?

inter_threads only applies to concurrent requests made to the same model, but based on your initial comment you are making concurrent requests to 6 different models. In this case only intra_threads applies.

The number of core refer to the physical ones or artificial ones?

Hello! I am not sure what environment @ArtanisTheOne is working on, but for CAT tools, usually cashing translations is recommended. In this sense, there is a chance that the translation of 30 or 60 sentences with the same model at the same time would be much faster than translation of only 8 sentences at a time. So basically, you pass these sentences as a list, and increase the value of max_batch_size.

Another aspect here for future consideration is multilinguality. One reason why companies now build multilingual models is precisely the use case that you described, i.e. smoothness of deployment and scalability.

All the best,

1 Like

In my case it applies to physical cores.

My question at this point is why setting intra_threads to a value which doesn’t follow the logic of below helps speed up computation when different models are running with the same intra value.

As for @ymoslem, my current setup is essentially an API, so any translation request I make is from an API request. Would setting max_batch_size to, as you say, 20 “examples”, improve upon concurrent requests.
Let’s say 2 requests are made (to the same model) which means translate_batch would be called on 2 different threads, would the max_batch_size of 20 examples cause request 1 and request 2 being processed as one batch if they are both 10 examples or does max_batch_size just help if splitting the passed in list to multiple batches.
Thanks for any insight

What’s the difference in performance when setting intra_threads to 16 vs. 128?

When having 6 concurrent requests using 6 different models the average translation time for all 6 is 2.59s with 16 as intra_threads, while 128 as intra_threads gets an average of 1.975s. Have done multiple tests with different texts and seems to come out similarly.