I’m trying to understand the behavior of translate_batch
when used with inter_threads
and ThreadPoolExecutor
.
My understanding of pybind11’s gil_scoped_release
class is that it can be used to temporarily release the GIL in python giving other threads the opportunity to run while the long-running thread is doing its thing. In particular, this is used with ctransate2’s translate_batch method.
With the GIL released, the time to run the same translation once synchronously (ex 1.), and twice in parallel (ex 2.) inside of a ThreadPoolExecutor, the running time should be about the same. When inter_threads
and intra_threads
are both =1
, this isn’t the case (these both being =1
shouldn’t matter here).
import time
import os
import sentencepiece
import ctranslate2
from concurrent.futures import ThreadPoolExecutor
os.environ["OMP_NUM_THREADS"]="1"
os.environ["MKL_NUM_THREADS"]="1"
spm = sentencepiece.SentencePieceProcessor("spm.model")
model = ctranslate2.Translator("model", inter_threads=1)
tokens = [spm.EncodeAsPieces("some sentence to translate")]
# Ex 1.
s = time.time()
model.translate_batch(tokens)
print("Sync once:", time.time() - s) # 0.219s
# Ex 2.
executor = ThreadPoolExecutor(max_workers=2)
s = time.time()
futures = [executor.submit(model.translate_batch, tokens) for _ in range(2)]
[f.result() for f in futures]
print(time.time() - s) # 0.395s
In the example above, inter_threads
and intra_threads
are both =1
the thread pool example takes twice as long. If the GIL was released, one would assume that the runtime should be the same.
However, if we increase inter_threads=2
, the runtime for ex 2. is the same as ex 1. and ThreadPoolExecutor seems to be working as intended. This behavior persists when inter_threads
, max_workers
, and the number of tasks is low, but performance degrades as these values increase even if they are well below the resources available on the machine.
Would anyone be able to help me gain some insight on to what is causing this behavior?