Inference Llama-2 with CTranslate2

akocherovskiy · July 26, 2023, 8:54am

Hi. I converted the Llama-2 model with CTranslate2 as shown here - Transformers — CTranslate2 3.17.1 documentation.
How can I inference Llama-2 with CTranslate2 not in the form of an interactive chat? Perform an inference, like other models, when I pass an input and get an output.

Like Falcon, for example here:


import ctranslate2
import transformers

generator = ctranslate2.Generator("falcon-7b-instruct", device="cuda")
tokenizer = transformers.AutoTokenizer.from_pretrained("tiiuae/falcon-7b-instruct")

prompt = (
    "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. "
    "Giraftron believes all other animals are irrelevant when compared to the glorious majesty "
    "of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:"
)

tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))

results = generator.generate_batch([tokens], sampling_topk=10, max_length=200, include_prompt_in_result=False)
output = tokenizer.decode(results[0].sequences_ids[0])

print(output)

Thank you.

guillaumekln · July 26, 2023, 9:29am

Hi,

You can use generate_batch the same way with Llama 2 models. You just need to change the path to the model and the tokenizer name.