Dear community,
I am happy to release v3.3 just after v3.2
New in v3.3
- Pytorch 2.0.1
- MMLU eval benchmark for LLMs (OpenNMT-py/eval_llm/MMLU/readme.md at master · OpenNMT/OpenNMT-py · GitHub)
- Inference in 4/8bit to support bigger model inference
- Safetensors preliminary support (faster loading, requires less Ram)
New in v3.2
- Much faster model building at startup (not noticeable for small models but huge for LLMs)
- Quantization support for 8bit, 4bit (FP4/NF4) with Bitsandbytes backend, GPTQ coming soon.
- LoRA adapter layers can now be quantized as well as other
nn.Linear
layers. - Model loading way faster with low cpu-ram usage. Will be zero when using safetensors checkpoints.
- Add multi-query attention (available on Falcon 7B/40B) which enable faster inference
- Add parallel residual attention (feature required for Falcon and GPT-J models)
- Feed Forward bias is now optional (same as QKV
nn.Linear
layers)
For NMT users, I encourage to try multiquery: true
and add_ffnbias: false
in the yaml config file. It should speed up both training and inference without losing quality. Feedback is very well appreciated.
For LLM users, here is the list of the models supported on a single RTX 24GB:
- Llama (GitHub - facebookresearch/llama: Inference code for LLaMA models) 7B, 13B
- MPT-7B (GitHub - mosaicml/llm-foundry: LLM training code for MosaicML foundation models)
- Open_llama (GitHub - openlm-research/open_llama: OpenLLaMA, a permissively licensed open source reproduction of Meta AI’s LLaMA 7B trained on the RedPajama dataset) 3B, 7B, 13B will be soon available
- Redpajama 7B (togethercomputer/RedPajama-INCITE-7B-Base · Hugging Face)
- Falcon 7B (tiiuae/falcon-7b · Hugging Face)
Bigger ones (33B 40B 65B should work on A6000 48GB or A100 80GB).
All converters are in OpenNMT-py/tools
For llama the converter will work from the original sharded checkpoints. All others will convert from the Hugging Face format (either HF hub directly or local directory copy)
Here are the architecture differences between models:
Positional encoding:
Llama, Open_llama, Redpajama, Falcon use Rotary Embeddings
MPT uses Alibi
Activation:
Llama, Open_llama use : Silu
Redpajama, MPT, Falcon: Gelu
(Along with Silu comes an extra FF layer)
Layer Normalization:
Llama, Open_llama: rms
MPT, Redpajama, Falcon: Standard
Attention:
Llama, Open_llama, MPT, Redpajama: classic Multihead
Falcon: Multiquery + Parallel Residual
They do not use the same model sizes and vocabs.
Llama, Open_llama use each their sentencepiece tokenizer and specific vocab
Llama tokenizer comes with the model download
Llama vocab (adjusted for Onmt-py): https://opennmt-models.s3.amazonaws.com/llama/vocab-llama.txt
Open_llama tokenizer: https://opennmt-models.s3.amazonaws.com/openllama/tokenizer.model
Open_llama vocab: https://opennmt-models.s3.amazonaws.com/openllama/openllama.vocab
MPT, Redpajama, Falcon use BPE
MPT bpe model: https://opennmt-models.s3.amazonaws.com/mosaic-MPT/mpt-model.bpe
MPT vocab: https://opennmt-models.s3.amazonaws.com/mosaic-MPT/mpt.vocab
Redpajama bpe model: https://opennmt-models.s3.amazonaws.com/redpajama/redpajama-model.bpe
Redpajama vocab: https://opennmt-models.s3.amazonaws.com/redpajama/redpajama.vocab
Falcon bpe model: https://opennmt-models.s3.amazonaws.com/falcon/falcon-model.bpe
Falcon vocab: https://opennmt-models.s3.amazonaws.com/falcon/falcon.vocab
All bpe models / vocab have been extracted from the HF tokenizer and slightly adjusted to match the onmt-py architecture in terms of (eos, bos, unk, pad). Also the bpe vocab of these models follow the “gpt2 pretokenization / mapping” to allow byte / special character encoding. (Option in onmt-py: gpt2_pretok: true
)
For more detail see here: gpt-2/src/encoder.py at master · openai/gpt-2 · GitHub
Please make sure you:
- open issues for BUGS on github
- open a new post on this forum for questions
As a reminder, do not forget that for inference, with CTranslate2 you can:
- convert any of those models from their original HF format directly into CT2 (which supports also some extra models like Bloom, OPT, CodeGen) but in this scenario you need to finetune outside the onmt ecosystem
- convert any onmt-py finetuned model into CT2 to run very fast inference.
ENJOY !