I just added an option to offload matrix multiplication in the
Linear layers to the GPU:
This is an experimental feature but it can yield a nice speedup for small batch sizes. For larger batch sizes, this could be under efficient as the input and output have to be transfered between the host and the device.
However, I have in mind to go further and explore quantized inference and INT8 matrix multiplication for faster host<->device transfer and computation. This is certainly tricky but Google claims to do it for GNMT:
(see section 6. Quantizable Model and Quantized Inference.)
Hopefully we can support a similar approach and make CTranslate a fast inference engine. Ideas or contributions to achieve this are welcome.