Under what conditions should CTranslate2 GPU translation be faster than CPU

I am trying to self-host a LibreTranslate service on a GPU server but the translation is not to be faster than on my CPU server, which surprises me.
My first hypothesis was that the GPU was not used by the LibreTranslate setup I made but if I use a tool like nvtop while translating, it shows that the GPU is used.
Maybe someone here already worked on a similar setup and knows what I am doing wrong ?

This is an issue someone submitted to the LibreTranslate forum. Based on the CTranslate2 benchmarks I would expect the GPU translation to be significantly faster than CPU translation. My best guess of what’s happening here is that the GPU translations have a higher throughput but without a latency improvement so it’s not noticable if you’re the only one using the server at that time. I haven’t done much CTranslate2 inference on GPUs myself (because the CPU performance is so good :rocket:).

The user ran end-to-end tests with a LibreTranslate instance. Do we know for sure that most of the processing time is spent in CTranslate2? What about other steps like the Stanza model?

When using CTranslate2 directly, a recent GPU is almost always faster than a CPU unless the model or the input are very small.

1 Like