GPU offloading in CTranslate

guillaumekln · March 17, 2017, 9:06am

I just added an option to offload matrix multiplication in the Linear layers to the GPU:

https://github.com/OpenNMT/CTranslate/commit/c42a01a8ff6febbd38af707faf00bbe94eebaea9

This is an experimental feature but it can yield a nice speedup for small batch sizes. For larger batch sizes, this could be under efficient as the input and output have to be transfered between the host and the device.

However, I have in mind to go further and explore quantized inference and INT8 matrix multiplication for faster host<->device transfer and computation. This is certainly tricky but Google claims to do it for GNMT:

(see section 6. Quantizable Model and Quantized Inference.)

Hopefully we can support a similar approach and make CTranslate a fast inference engine. Ideas or contributions to achieve this are welcome.