Fast CPU decoding


(jean.senellart) #1

Hello all!

It has been a long time I have been a bit disconnected from the forum working on some other projects. Sorry for that, I will do my best to catch up and I wanted to share some update here in this post and the following…

First, we did submit a system to - on the CPU track. A draft of our system description is here: Our goal was to try to build a tiny but still performant model and we achieved that by combining multiple techniques:

  • training a strong transformer model then distillation on a simple RNN
  • minimization of the model - our tiny model is GRU-based, 2 layer encoder, and one single layer decoder. This is a very small and simple model, but thanks to distillation, keeps a very good performance
  • introduction of vocabulary mapping which is a dynamic target vocabulary reduction. This is equivalent to the subdict option we already introduced long time ago in the lua code (but that we never really used). Also we essentially fine-tuned the vmap extraction process
  • introduction of quantization techniques
  • running everything with CTranslate

The result was quite conclusive since we managed to get a system running at 1000 words per second on one single core (the fastest CPU submission) while keeping a score under control and with a model size of… 75Mb.

All of the features we used are already in the code or are on the way.

In the other submissions from other participants, there is however also an impressive tiny transformer implementation - probably based on this work: - already on our todo list!

Also as a heads-up, we are also currently working on a new version of CTranslate which will extend its scope beyond lua models and will come with these optimization techniques. @guillaumekln will come back to you soon with more good news!


Why Lua/Torch? (Please don't hate me for this question.)