Finetuning Llama-7B/13B or MosaicML MPT-7B - Reproduce Vicuna / Alpaca

I parsed the Open Assistant Dataset and made it OpenNMT-py friendly for finetuning.

https://opennmt-models.s3.amazonaws.com/llama/osst1.flattened.txt

Can be tested now with:
Llama
MPT7B
Open Llama
Redpajama

I have not tested any finetuning with this dataset but 1) it’s fully open source 2) The Open Assistant project reports good results.

1 Like

Can you give configuration example for training Redpajama 3B ?

Llama 13B fits on a RTX with 24GB / 512 context length

[2023-05-31 16:17:05,778 INFO] Step 10/ 4000; acc: 68.7; ppl:   3.6; xent: 1.3; lr: 0.00020; sents:     602; bsz:  402/ 402/ 2; 812/812 tok/s;    159 sec;
[2023-05-31 16:18:53,812 INFO] Step 20/ 4000; acc: 73.1; ppl:   2.8; xent: 1.0; lr: 0.00020; sents:     505; bsz:  392/ 392/ 2; 1161/1161 tok/s;    267 sec;
[2023-05-31 16:20:37,474 INFO] Step 30/ 4000; acc: 73.8; ppl:   2.7; xent: 1.0; lr: 0.00020; sents:     506; bsz:  395/ 395/ 2; 1219/1219 tok/s;    370 sec;
[2023-05-31 16:22:16,611 INFO] Step 40/ 4000; acc: 73.3; ppl:   2.7; xent: 1.0; lr: 0.00020; sents:     460; bsz:  384/ 384/ 1; 1240/1240 tok/s;    469 sec;
[2023-05-31 16:23:56,426 INFO] Step 50/ 4000; acc: 74.8; ppl:   2.6; xent: 0.9; lr: 0.00020; sents:     542; bsz:  394/ 394/ 2; 1263/1263 tok/s;    569 sec;
[2023-05-31 16:25:38,707 INFO] Step 60/ 4000; acc: 74.6; ppl:   2.6; xent: 0.9; lr: 0.00020; sents:     510; bsz:  395/ 395/ 2; 1235/1235 tok/s;    671 sec;
[2023-05-31 16:27:19,223 INFO] Step 70/ 4000; acc: 74.5; ppl:   2.6; xent: 0.9; lr: 0.00020; sents:     444; bsz:  383/ 383/ 1; 1219/1219 tok/s;    772 sec;
[2023-05-31 16:29:02,811 INFO] Step 80/ 4000; acc: 75.2; ppl:   2.5; xent: 0.9; lr: 0.00020; sents:     550; bsz:  394/ 394/ 2; 1217/1217 tok/s;    876 sec;
[2023-05-31 16:30:45,309 INFO] Step 90/ 4000; acc: 75.1; ppl:   2.5; xent: 0.9; lr: 0.00020; sents:     559; bsz:  393/ 393/ 2; 1226/1226 tok/s;    978 sec;
[2023-05-31 16:32:27,957 INFO] Step 100/ 4000; acc: 75.5; ppl:   2.5; xent: 0.9; lr: 0.00020; sents:     495; bsz:  396/ 396/ 2; 1234/1234 tok/s;   1081 sec;

Hi, can you explain how to get the “mpt7b-onmt.pt” file in this yaml file? Thank you!

I got it doing something like this for chat version:

python

import transformers
model = transformers.AutoModelForCausalLM.from_pretrained(
  'mosaicml/mpt-7b-chat',
  trust_remote_code=True
)
model.save_pretrained("mpt_chat")

quit()

python …/OpenNMT-py/tools/convert_mpt.py
–model_dir mpt_chat
–vocab_file mpt.vocab
–output mpt_chat_onmt.pt

you can also directly do

python …/OpenNMT-py/tools/convert_mpt.py
–model_dir mosaicml/mpt-7b
–vocab_file mpt.vocab
–output mpt7b-onmt.pt

Have a look the convert script, it uses the same AutoModelForCausalLM.from pretrained()

This can take a relative directory and will pick directly the files on the HF hub, or a local directory if you have previously downloaded the .bin .json files.

Is convet file (ex: mpt7B_onmt.pt) different form “mpt7B-vicuna-onmt” like below?

When I run python3 OpenNMT-py/onmt/bin/train.py -config Finetune/mpt/mpt_vicuna.yaml and got this error:

Traceback (most recent call last):
File “OpenNMT-py/onmt/bin/train.py”, line 71, in
main()
File “OpenNMT-py/onmt/bin/train.py”, line 67, in main
train(opt)
File “OpenNMT-py/onmt/bin/train.py”, line 52, in train
train_process(opt, device_id=0)
File “/root/OpenNMT-py/onmt/train_single.py”, line 196, in main
optim = Optimizer.from_opt(model, opt, checkpoint=checkpoint)
File “/root/OpenNMT-py/onmt/utils/optimizers.py”, line 274, in from_opt
build_torch_optimizer(model, optim_opt),
File “/root/OpenNMT-py/onmt/utils/optimizers.py”, line 76, in build_torch_optimizer
optimizer = FusedAdam(params, lr=opt.learning_rate, betas=betas)
File “/root/OpenNMT-py/onmt/utils/optimizers.py”, line 635, in init
fused_adam_cuda = importlib.import_module(“fused_adam_cuda”)
File “/opt/conda/lib/python3.8/importlib/init.py”, line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File “”, line 1014, in _gcd_import
File “”, line 991, in _find_and_load
File “”, line 975, in _find_and_load_unlocked
File “”, line 657, in _load_unlocked
File “”, line 556, in module_from_spec
File “”, line 1101, in create_module
File “”, line 219, in _call_with_frames_removed
ImportError: /opt/conda/lib/python3.8/site-packages/fused_adam_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol: _Z13__THCudaCheck9cudaErrorPKci

Finetune/mpt/mpt_vicuna.yaml is the same with the above MPT yaml

Can you help me to solve?

Thank a lot!

you need to install apex.
like here

this will give you access to fusedadam which by far the best optimizer in terms of speed and memory footprint. I compared to adam (even with fuse=True).

[and just last night with bnb.optim.Adam8bit (even paged or not)]

1 Like

How did you do that? I can not fit it in my hardware, It seems that I have not enought RAM. I have 63GB of ram. I get the train.py killed when I wun it:

[2023-06-01 15:19:38,254 INFO] 8bit compression of layer w_2                                                                                                                     
[2023-06-01 15:19:54,795 INFO] 8bit compression of layer w_3                                                                                                                     
[2023-06-01 15:20:11,264 INFO] Adding LoRa layers for linear_values                        
[2023-06-01 15:20:25,309 INFO] Adding LoRa layers for linear_query                         
[2023-06-01 15:20:39,329 INFO] Adding LoRa layers for linear_keys                          
Killed      

With these Lora setting:


# LoRa
lora_layers: ['linear_values', 'linear_query', 'linear_keys', 'final_linear']
lora_rank: 2
lora_dropout: 0.05
lora_alpha: 16
lora_embedding: false

So it gets kill between Adding LoRa layers for linear_keys and Adding LoRa layers for final_linear.

I am using the last OpenNMT commit. Is there a way to start moving things to GPU before it overloads my RAM?

I do not know if something else can be overloading my RAM. My full config is this one:

############################
#### Finetuning corpora ####
############################
data:
    alpaca:
        path_src: dataAI/alpaca_clean_no_input.txt
        transforms: [sentencepiece, filtertoolong]
        weight: 10
    open_assistant:
        path_src: dataAI/osst1_flattened_txt
        transforms: [sentencepiece, filtertoolong]
        weight: 10


################################
#### Subword and vocabulary #### 
################################
src_subword_model: llama/tokenizer.model
tgt_subword_model: llama/tokenizer.model

src_vocab: vocab.txt
vocab_size_multiple: 1
share_vocab: True

################
#### Filter ####
################
src_seq_length: 1024
tgt_seq_length: 1024

# silently ignore empty lines in the data
skip_empty_level: silent

#######################
####  General opts ####
#######################
train_from: llama13B-vicuna-onmt.pt
save_model: finetuned_llama13B/llama13B-vicuna-onmt
override_opts: true # to apply LoRa
report_every: 1
save_checkpoint_steps: 1000

save_data:  finetuned_llama13B/samples
dump_samples: false
n_sample: 0

tensorboard: true
tensorboard_log_dir: finetuned_llama13B/logs

#################
##### Model #####
#################

# Recall of the overrided opts
model_task: lm
decoder_type: transformer_lm
add_qkvbias: false
dec_layers: 40
heads: 40
hidden_size: 5120
word_vec_size: 5120
transformer_ff: 11008
dropout_steps: [0]
dropout: [0.0]
attention_dropout: [0.0]

model_dtype: fp16
seed: 1234
param_init: 0.0
param_init_glorot: true
batch_size: 512
normalization: tokens
valid_batch_size: 256
train_steps: 4000
optim: fusedadam
max_grad_norm: 0.0
adam_beta2: 0.998
learning_rate : 2e-05

# LLama compatibiliy
layer_norm: rms
pos_ffn_activation_fn: 'silu'
max_relative_positions: -1
position_encoding: false


# LoRa
lora_layers: ['linear_values', 'linear_query', 'linear_keys', 'final_linear']
lora_rank: 2
lora_dropout: 0.05
lora_alpha: 16
lora_embedding: false

# 8bit compression
quant_layers: ['w_1', 'w_2', 'w_3']

# GPU 
world_size: 1
gpu_ranks: [0]

# Optimization
accum_count: [32]
accum_steps: [0]
batch_size: 512 
valid_batch_size: 512
batch_type: tokens
batch_size_multiple: 1
bucket_size: 1024

for the 13B you need to wait for the pending PR .

1 Like

After a few hiccups, the training works for me on a V100. Do you know how I can utilize both the GPUs on 2 x V100 cluster. Currently I see only one GPU being utilized. I have tried setting “-gpuid 0 1” in the script as well.

Edit: Just encountered OOM on 1 GPU.

Add this in the yaml

world_size: 4 
gpu_ranks: [0,1,2,3]

if you wait for the next PR, it will be optimized by far.

also what were the hiccups so that I can update the tuto ?

Thanks for the quick response. Unfortunately for me, it goes OOM even when using both the GPUs. Do you think the upcoming PR should be able to resolve that?

You can now git pull and use these options in the yaml file:

#4/8bit
quant_layers: ['w_1', 'w_2', 'w_3', 'linear_values', 'linear_query', 'linear_keys', 'final_linear']
quant_type: "bnb_NF4"

#LoRa
lora_layers: ['linear_values', 'linear_query', 'linear_keys', 'final_linear']
lora_rank: 8
lora_dropout: 0.05
lora_alpha: 16
lora_embedding: false

# Chekpointing
#use_ckpting: ['ffn', 'lora']

You should not have to activate the last option, the 13B just trains fine with 512 sequence length and batchsize (on a RTX 24GB card). If you are using a V100 with 16GB you need to stick to the 7B model)

Yes, I am using MPT 7B itself. Unfortunately, with the updated code as well, I run OOM: Following is the config:

# Corpus opts:
data:
    alpaca:
        path_src: "onmt-mpt/alpaca_clean.txt"
        transforms: [onmt_tokenize, filtertoolong]
        weight: 10
    sharegpt:
        path_src: "onmt-mpt/sharegpt.txt"
        transforms: [onmt_tokenize, filtertoolong]
        weight: 10
        

### Transform related opts:
#### Subword
src_subword_type: bpe
src_subword_model: "onmt-mpt/mpt-model.bpe"
src_onmttok_kwargs: '{"mode": "conservative"}'

tgt_subword_type: bpe
tgt_subword_model: "onmt-mpt/mpt-model.bpe"
tgt_onmttok_kwargs: '{"mode": "conservative"}'
gpt2_pretok: true

#### Filter
src_seq_length: 512
tgt_seq_length: 512

# silently ignore empty lines in the data
skip_empty_level: silent

# General opts
train_from: "onmt-mpt/mpt_chat_onmt.pt"
save_model: "./mpt7B-vicuna-onmt"
keep_checkpoint: 10
save_checkpoint_steps: 400
seed: 1234
report_every: 10
train_steps: 4000
valid_steps: 400

# Batching
bucket_size: 32768
#bucket_size: 1
num_workers: 2
world_size: 2
gpu_ranks: [0, 1]
batch_type: "tokens"
batch_size: 896 
valid_batch_size: 256
batch_size_multiple: 1
accum_count: [32]
accum_steps: [0]

override_opts: true  # CAREFULL this requires all settings to be defined below

share_vocab: true
save_data: "/dataAI"
src_vocab: "onmt-mpt/mpt.vocab"
src_vocab_size: 50432
tgt_vocab_size: 50432
default_specials: ['</s>', '<blank>']

decoder_start_token: '</s>'
# Optimization
model_dtype: "fp16"
apex_opt_level: ""
optim: "fusedadam"
learning_rate: 0.0002
warmup_steps: 100
decay_method: "none"
#learning_rate_decay: 0.98
#start_decay_steps: 100
#decay_steps: 10
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.0
param_init: 0
param_init_glorot: true
normalization: "tokens"

#4/8bit
quant_layers: ['w_1', 'w_2', 'w_3', 'linear_values', 'linear_query', 'linear_keys', 'final_linear']
quant_type: "bnb_NF4"
#
##LoRa
lora_layers: ['linear_values', 'linear_query', 'linear_keys', 'final_linear']
lora_rank: 8
lora_dropout: 0.05
lora_alpha: 16
lora_embedding: false

# Model
model_task: lm
decoder_type: transformer_lm
layer_norm: standard
pos_ffn_activation_fn: 'gelu'
max_relative_positions: -2
position_encoding: false
add_qkvbias: false
dec_layers: 32
heads: 32
hidden_size: 4096
word_vec_size: 4096
transformer_ff: 16384
dropout_steps: [0]
dropout: [0.0]
attention_dropout: [0.0]

if you are using a 16GB V100 just reduce your batch_size: 896 to 512 or lower if still too high.

I keep having same RAM problems at the same spot when I load the model, does not matter what quantization I use or whether I quantize Lora weights or not.

can you monitor your cpu ram / swap disk while this is happening ?
I have 64GB of RAM too, and a swap of 256GB

It seems that I have no swap memory at all:

I saw that it reachs the 62.8GB when it tries to add the final_linear lora layer:

[2023-06-01 15:19:38,254 INFO] 8bit compression of layer w_2                                                                                                                     
[2023-06-01 15:19:54,795 INFO] 8bit compression of layer w_3                                                                                                                     
[2023-06-01 15:20:11,264 INFO] Adding LoRa layers for linear_values                        
[2023-06-01 15:20:25,309 INFO] Adding LoRa layers for linear_query                         
[2023-06-01 15:20:39,329 INFO] Adding LoRa layers for linear_keys                          
Killed      

It does not matter what quantization I use or whether I quantize Lora weights or not. It also does not matter if I use new branch or not (I see different logging for new branch version, but it gets kill in the same execution point).

I know this is just because the checkpoint is not sharded and as is, the cpu ram needs to fit the state_dict + the model.
I will change this in the coming days but in the meantime it is an easy fix to add some swap space.

1 Like