I am trying to self-host a LibreTranslate service on a GPU server but the translation is not to be faster than on my CPU server, which surprises me.
My first hypothesis was that the GPU was not used by the LibreTranslate setup I made but if I use a tool likenvtop
while translating, it shows that the GPU is used.
Maybe someone here already worked on a similar setup and knows what I am doing wrong ?
This is an issue someone submitted to the LibreTranslate forum. Based on the CTranslate2 benchmarks I would expect the GPU translation to be significantly faster than CPU translation. My best guess of what’s happening here is that the GPU translations have a higher throughput but without a latency improvement so it’s not noticable if you’re the only one using the server at that time. I haven’t done much CTranslate2 inference on GPUs myself (because the CPU performance is so good ).