Hi. I converted the Llama-2 model with CTranslate2 as shown here - Transformers — CTranslate2 3.17.1 documentation.
How can I inference Llama-2 with CTranslate2 not in the form of an interactive chat? Perform an inference, like other models, when I pass an input and get an output.
Like Falcon, for example here:
import ctranslate2
import transformers
generator = ctranslate2.Generator("falcon-7b-instruct", device="cuda")
tokenizer = transformers.AutoTokenizer.from_pretrained("tiiuae/falcon-7b-instruct")
prompt = (
"Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. "
"Giraftron believes all other animals are irrelevant when compared to the glorious majesty "
"of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:"
)
tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))
results = generator.generate_batch([tokens], sampling_topk=10, max_length=200, include_prompt_in_result=False)
output = tokenizer.decode(results[0].sequences_ids[0])
print(output)
Thank you.