Hello there again,
I want to improve my servers production with CT2 as much as possible and just want to have some clarification with some certain things. Right now I load an assortment of models with the same
intra_threads (4 and 4 respectively with a 16 core cpu). My current use-case will have a source message request the server to translate the same-message to 6 different languages (6 different models, using translate_batch with sentence splitting). As a result of this, I find that each core/thread seems to share processing power so that instead of each request being completed sequentially (eg takes 1s to translate) it’ll finish translating all 6 and release them at once (after ~6s).
In regards to
translate_batch, if I make translation request with 8 sentences (8 elements in passed in list) would the batch method use all 4
inter_threads twice or use one thread to complete that batch (no passed in batch_size atm).
My proposed solution right now to the described “issue” would be to count how many current batches there are being processed, as to have some translations finish sooner than others instead of all at once, but I’m not sure how this might affect performance if using
translate_batch 6 times across 6 models helps performance.
Thanks for any insight anyone can provide!
On that thought I was testing out intra_threads and inter_threads values and found that inter_threads being set to 1 and intra_threads being set to 128 would drastically increase translation times for multiple translations at a time (when 8 models are all translating at same time) while having normal translation times (1 translation for a long phrargraph) be slightly longer by (0.3s~).
One translation request will use the 4
inter_threads. The documentation describes when the
inter_threads are used: Multithreading and parallelism — CTranslate2 2.24.0 documentation
The general approach is:
intra_threads to reduce the translation time for a single request (latency is improved)
inter_threads to reduce the global translation time of multiple concurrent requests to the same model (throughput is improved)
You should avoid setting
intra_threads to a value larger than the number of CPU cores.
This is why I am a bit perplexed, as when I set my
intra_threads to 128 (256 and above no longer improves) on a 16 core CPU I achieve much faster concurrent translation times (from 8s when 8 requests processed at once to 2s) than setting it to a value which follows logic and creates multiple
Meanwhile single translations which are not executed in parallel stay close to the same.
Would you know of any explanation for this?
inter_threads only applies to concurrent requests made to the same model, but based on your initial comment you are making concurrent requests to 6 different models. In this case only
The number of core refer to the physical ones or artificial ones?
Hello! I am not sure what environment @ArtanisTheOne is working on, but for CAT tools, usually cashing translations is recommended. In this sense, there is a chance that the translation of 30 or 60 sentences with the same model at the same time would be much faster than translation of only 8 sentences at a time. So basically, you pass these sentences as a list, and increase the value of
Another aspect here for future consideration is multilinguality. One reason why companies now build multilingual models is precisely the use case that you described, i.e. smoothness of deployment and scalability.
All the best,
In my case it applies to physical cores.
My question at this point is why setting
intra_threads to a value which doesn’t follow the logic of below helps speed up computation when different models are running with the same
As for @ymoslem, my current setup is essentially an API, so any translation request I make is from an API request. Would setting
max_batch_size to, as you say, 20 “examples”, improve upon concurrent requests.
Let’s say 2 requests are made (to the same model) which means
translate_batch would be called on 2 different threads, would the
20 examples cause request 1 and request 2 being processed as one batch if they are both 10 examples or does
max_batch_size just help if splitting the passed in list to multiple batches.
Thanks for any insight
What’s the difference in performance when setting
intra_threads to 16 vs. 128?
When having 6 concurrent requests using 6 different models the average translation time for all 6 is
2.59s with 16 as
128 as intra_threads gets an average of
1.975s. Have done multiple tests with different texts and seems to come out similarly.