Inference of the Llama Language Model (and Alpaca, Vicuna, ...)

Dear Users,

As you all know, Meta has released a collection of “Foundation Language models” ranging from 7B to 65B parameters. The license for this model is clearly “non-commercial” and it is the same for all derivatives.

Nevertheless there has been a lot of work done by the community trying to base ChatGPT-like chatbots on those Llama models. There are two steps 1) finetuning the LM on a “Instruction-based” dataset and 2) Inference of the finetuned model with a fast tool (potentially within a nice interface).

We will try to cover both within the OpenNMT eco-system but it is necessary to explain the various format of the model.

Original llama format:
When you have the chance to download from Meta the original files you will get binary files that correspond to shards of the model. (The smaller one 7B is only one shard, the bigger one is 8 shards). Files are named “consolidated.0X.pth” (X being the shard number).

Hugging Face format:
There has been a lot of work pushed on the HF hub (Alpaca, Vicuna, …). They convert the original format into the HF format and most of projects publish only the “delta weights” that need to be merged with the converted HF checkpoint. Some project also enable to convert back the HF finetuned checkpoint into the original Llama format.

OpenNMT-py format:
We supply a tool to convert the original Llama format into the OpenNMT-py models format.

Now how to use fast inference ?
One emerging tool llama.cpp made a lot of noise on Twitter (same creator as whisper.cpp)

HOWEVER, we have CTranslate2 which is even faster (yet not integrated in UI interface) and usable with a REST Api.

Ctranslate2 can convert:

  1. the original Llama format
  2. Hugging face “transformers” models
  3. OpenNMT-py models

In a nutshell, you can:

  • either do everything in OpenNMT (finetuning llama in the next post) and perform inference aither with OpenNMT-py or Ctranslate2
  • or convert a finetuned checkpoint (from HF Peft) and use CTranslate2 for inference

If someone is willing to develop a gradio based interface, PR are always welcome !