OpenNMT-py v3.3 released - following 3.2 with plenty of new features

Dear community,

I am happy to release v3.3 just after v3.2

New in v3.3

New in v3.2

  • Much faster model building at startup (not noticeable for small models but huge for LLMs)
  • Quantization support for 8bit, 4bit (FP4/NF4) with Bitsandbytes backend, GPTQ coming soon.
  • LoRA adapter layers can now be quantized as well as other nn.Linear layers.
  • Model loading way faster with low cpu-ram usage. Will be zero when using safetensors checkpoints.
  • Add multi-query attention (available on Falcon 7B/40B) which enable faster inference
  • Add parallel residual attention (feature required for Falcon and GPT-J models)
  • Feed Forward bias is now optional (same as QKV nn.Linear layers)

For NMT users, I encourage to try multiquery: true and add_ffnbias: false in the yaml config file. It should speed up both training and inference without losing quality. Feedback is very well appreciated.

For LLM users, here is the list of the models supported on a single RTX 24GB:

Bigger ones (33B 40B 65B should work on A6000 48GB or A100 80GB).

All converters are in OpenNMT-py/tools

For llama the converter will work from the original sharded checkpoints. All others will convert from the Hugging Face format (either HF hub directly or local directory copy)

Here are the architecture differences between models:

Positional encoding:
Llama, Open_llama, Redpajama, Falcon use Rotary Embeddings
MPT uses Alibi

Activation:
Llama, Open_llama use : Silu
Redpajama, MPT, Falcon: Gelu
(Along with Silu comes an extra FF layer)

Layer Normalization:
Llama, Open_llama: rms
MPT, Redpajama, Falcon: Standard

Attention:
Llama, Open_llama, MPT, Redpajama: classic Multihead
Falcon: Multiquery + Parallel Residual

They do not use the same model sizes and vocabs.

Llama, Open_llama use each their sentencepiece tokenizer and specific vocab
Llama tokenizer comes with the model download
Llama vocab (adjusted for Onmt-py): https://opennmt-models.s3.amazonaws.com/llama/vocab-llama.txt
Open_llama tokenizer: https://opennmt-models.s3.amazonaws.com/openllama/tokenizer.model
Open_llama vocab: https://opennmt-models.s3.amazonaws.com/openllama/openllama.vocab

MPT, Redpajama, Falcon use BPE
MPT bpe model: https://opennmt-models.s3.amazonaws.com/mosaic-MPT/mpt-model.bpe
MPT vocab: https://opennmt-models.s3.amazonaws.com/mosaic-MPT/mpt.vocab
Redpajama bpe model: https://opennmt-models.s3.amazonaws.com/redpajama/redpajama-model.bpe
Redpajama vocab: https://opennmt-models.s3.amazonaws.com/redpajama/redpajama.vocab
Falcon bpe model: https://opennmt-models.s3.amazonaws.com/falcon/falcon-model.bpe
Falcon vocab: https://opennmt-models.s3.amazonaws.com/falcon/falcon.vocab

All bpe models / vocab have been extracted from the HF tokenizer and slightly adjusted to match the onmt-py architecture in terms of (eos, bos, unk, pad). Also the bpe vocab of these models follow the “gpt2 pretokenization / mapping” to allow byte / special character encoding. (Option in onmt-py: gpt2_pretok: true)
For more detail see here: gpt-2/src/encoder.py at master · openai/gpt-2 · GitHub

Please make sure you:

  1. open issues for BUGS on github
  2. open a new post on this forum for questions

As a reminder, do not forget that for inference, with CTranslate2 you can:

  • convert any of those models from their original HF format directly into CT2 (which supports also some extra models like Bloom, OPT, CodeGen) but in this scenario you need to finetune outside the onmt ecosystem
  • convert any onmt-py finetuned model into CT2 to run very fast inference.

ENJOY !

9 Likes

Great work!
Could you share a sample of the yaml file like replicate_vicuna.yaml for the different models? Even this yaml file for LLaMA can’t be found on the official documentation other a mention or on the Github.

done: