Finetuning Llama-7B/13B or MosaicML MPT-7B - Reproduce Vicuna / Alpaca

vince62s · May 25, 2023, 1:00pm

I parsed the Open Assistant Dataset and made it OpenNMT-py friendly for finetuning.

https://opennmt-models.s3.amazonaws.com/llama/osst1.flattened.txt

Can be tested now with:
Llama
MPT7B
Open Llama
Redpajama

I have not tested any finetuning with this dataset but 1) it’s fully open source 2) The Open Assistant project reports good results.

sersh · May 28, 2023, 11:37am

Can you give configuration example for training Redpajama 3B ?

vince62s · May 31, 2023, 2:35pm

Llama 13B fits on a RTX with 24GB / 512 context length

[2023-05-31 16:17:05,778 INFO] Step 10/ 4000; acc: 68.7; ppl:   3.6; xent: 1.3; lr: 0.00020; sents:     602; bsz:  402/ 402/ 2; 812/812 tok/s;    159 sec;
[2023-05-31 16:18:53,812 INFO] Step 20/ 4000; acc: 73.1; ppl:   2.8; xent: 1.0; lr: 0.00020; sents:     505; bsz:  392/ 392/ 2; 1161/1161 tok/s;    267 sec;
[2023-05-31 16:20:37,474 INFO] Step 30/ 4000; acc: 73.8; ppl:   2.7; xent: 1.0; lr: 0.00020; sents:     506; bsz:  395/ 395/ 2; 1219/1219 tok/s;    370 sec;
[2023-05-31 16:22:16,611 INFO] Step 40/ 4000; acc: 73.3; ppl:   2.7; xent: 1.0; lr: 0.00020; sents:     460; bsz:  384/ 384/ 1; 1240/1240 tok/s;    469 sec;
[2023-05-31 16:23:56,426 INFO] Step 50/ 4000; acc: 74.8; ppl:   2.6; xent: 0.9; lr: 0.00020; sents:     542; bsz:  394/ 394/ 2; 1263/1263 tok/s;    569 sec;
[2023-05-31 16:25:38,707 INFO] Step 60/ 4000; acc: 74.6; ppl:   2.6; xent: 0.9; lr: 0.00020; sents:     510; bsz:  395/ 395/ 2; 1235/1235 tok/s;    671 sec;
[2023-05-31 16:27:19,223 INFO] Step 70/ 4000; acc: 74.5; ppl:   2.6; xent: 0.9; lr: 0.00020; sents:     444; bsz:  383/ 383/ 1; 1219/1219 tok/s;    772 sec;
[2023-05-31 16:29:02,811 INFO] Step 80/ 4000; acc: 75.2; ppl:   2.5; xent: 0.9; lr: 0.00020; sents:     550; bsz:  394/ 394/ 2; 1217/1217 tok/s;    876 sec;
[2023-05-31 16:30:45,309 INFO] Step 90/ 4000; acc: 75.1; ppl:   2.5; xent: 0.9; lr: 0.00020; sents:     559; bsz:  393/ 393/ 2; 1226/1226 tok/s;    978 sec;
[2023-05-31 16:32:27,957 INFO] Step 100/ 4000; acc: 75.5; ppl:   2.5; xent: 0.9; lr: 0.00020; sents:     495; bsz:  396/ 396/ 2; 1234/1234 tok/s;   1081 sec;

roy-shih · June 1, 2023, 9:21am

Hi, can you explain how to get the “mpt7b-onmt.pt” file in this yaml file? Thank you!

martin_bombin · June 1, 2023, 9:23am

I got it doing something like this for chat version:

python

import transformers
model = transformers.AutoModelForCausalLM.from_pretrained(
  'mosaicml/mpt-7b-chat',
  trust_remote_code=True
)
model.save_pretrained("mpt_chat")

quit()

python …/OpenNMT-py/tools/convert_mpt.py
–model_dir mpt_chat
–vocab_file mpt.vocab
–output mpt_chat_onmt.pt

vince62s · June 1, 2023, 9:32am

you can also directly do

python …/OpenNMT-py/tools/convert_mpt.py
–model_dir mosaicml/mpt-7b
–vocab_file mpt.vocab
–output mpt7b-onmt.pt

Have a look the convert script, it uses the same AutoModelForCausalLM.from pretrained()

This can take a relative directory and will pick directly the files on the HF hub, or a local directory if you have previously downloaded the .bin .json files.

roy-shih · June 1, 2023, 9:33am

Is convet file (ex: mpt7B_onmt.pt) different form “mpt7B-vicuna-onmt” like below?

When I run python3 OpenNMT-py/onmt/bin/train.py -config Finetune/mpt/mpt_vicuna.yaml and got this error:

Traceback (most recent call last):
File “OpenNMT-py/onmt/bin/train.py”, line 71, in
main()
File “OpenNMT-py/onmt/bin/train.py”, line 67, in main
train(opt)
File “OpenNMT-py/onmt/bin/train.py”, line 52, in train
train_process(opt, device_id=0)
File “/root/OpenNMT-py/onmt/train_single.py”, line 196, in main
optim = Optimizer.from_opt(model, opt, checkpoint=checkpoint)
File “/root/OpenNMT-py/onmt/utils/optimizers.py”, line 274, in from_opt
build_torch_optimizer(model, optim_opt),
File “/root/OpenNMT-py/onmt/utils/optimizers.py”, line 76, in build_torch_optimizer
optimizer = FusedAdam(params, lr=opt.learning_rate, betas=betas)
File “/root/OpenNMT-py/onmt/utils/optimizers.py”, line 635, in init
fused_adam_cuda = importlib.import_module(“fused_adam_cuda”)
File “/opt/conda/lib/python3.8/importlib/init.py”, line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File “”, line 1014, in _gcd_import
File “”, line 991, in _find_and_load
File “”, line 975, in _find_and_load_unlocked
File “”, line 657, in _load_unlocked
File “”, line 556, in module_from_spec
File “”, line 1101, in create_module
File “”, line 219, in _call_with_frames_removed
ImportError: /opt/conda/lib/python3.8/site-packages/fused_adam_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol: _Z13__THCudaCheck9cudaErrorPKci

Finetune/mpt/mpt_vicuna.yaml is the same with the above MPT yaml

Can you help me to solve?

Thank a lot!

vince62s · June 1, 2023, 9:37am

you need to install apex.
like here

github.com

OpenNMT/OpenNMT-py/blob/master/docs/source/examples/replicate_vicuna/ReplicateVicuna.md#dependencies

# Supervised Finetuning of llama 7B to replicate Vicuna
This tutorial shows how to finetune a LLaMA 7B foundation model on instruction data including multi-round conversations.

Different features will be enabled:
- Application of the LoRa method to the attention layers.
- 8bit compression of the position-wise feed-forward layers.
- Architectural improvements used during the training of the llama models (RMS normalisation, Rotary Embeddings, SwiGLU activation).

The maximal context length will be set to 512.

Here is a short description of the content of your current directory:

- The OpenNMT-py repository.
- The `replicate_vicuna.yaml` file.
- A subdirectory named "llama" with the llama chekpoints.
- The converted llama7B checkpoint (`llama7B-vicuna-onmt`) and the vocabulary (`vocab.txt`) that will be genenerated with OpenNMT tools.
- A subdirectory named "dataAI" with the datasets for the finetuning.
- A subdirectory named "finetuned_llama7B" that will contain the finetuning samples, tensorboard logs and checkpoints.
- The `translate_opts.yaml` file with the translation options for the inference with `OpenNMT-py/onmt/bin/translate.py`.
- A subdirectory named "inputs" containing the `input_examples.txt` file with the input examples for the inference.

This file has been truncated. show original

this will give you access to fusedadam which by far the best optimizer in terms of speed and memory footprint. I compared to adam (even with fuse=True).

github.com/pytorch/pytorch

Adam (fused=True) issues

opened 08:32AM - 13 Dec 22 UTC

vince62s

module: performance module: optimizer module: cuda triaged

### 🐛 Describe the bug I am having serious doubt on the 1.13 Adam(fused=true) i…mplementation. First it should be faster than fused=False, in our case it is not, but also the numerical seems off. Adam(fused=False) [2022-12-09 12:21:37,894 INFO] Start training loop and validate every 10000 steps... [2022-12-09 12:23:49,851 INFO] Step 100/100000; acc: 16.1; ppl: 6103.2; xent: 8.7; lr: 0.00002; sents: 169920; bsz: 8605/10659/340; 32607/40388 tok/s; 132 sec; [2022-12-09 12:25:22,693 INFO] Step 200/100000; acc: 21.4; ppl: 957.8; xent: 6.9; lr: 0.00005; sents: 153056; bsz: 8588/10584/306; 46250/57000 tok/s; 225 sec; [2022-12-09 12:26:55,347 INFO] Step 300/100000; acc: 25.8; ppl: 414.5; xent: 6.0; lr: 0.00007; sents: 139424; bsz: 8796/10695/279; 47467/57717 tok/s; 317 sec; Adam(fused=True) [2022-12-09 12:29:46,843 INFO] Step 100/100000; acc: 15.9; ppl: 7363.9; xent: 8.9; lr: 0.00002; sents: 169920; bsz: 8605/10659/340; 32465/40212 tok/s; 133 sec; [2022-12-09 12:31:19,458 INFO] Step 200/100000; acc: 18.1; ppl: 1606.6; xent: 7.4; lr: 0.00005; sents: 153056; bsz: 8588/10584/306; 46363/57140 tok/s; 225 sec; [2022-12-09 12:32:53,634 INFO] Step 300/100000; acc: 22.1; ppl: 702.1; xent: 6.6; lr: 0.00007; sents: 139424; bsz: 8796/10695/279; 46700/56784 tok/s; 319 sec; Fusedadam apex O2 [2022-12-09 12:35:51,087 INFO] Step 100/100000; acc: 16.1; ppl: 6136.2; xent: 8.7; lr: 0.00002; sents: 169920; bsz: 8605/10659/340; 36301/44964 tok/s; 119 sec; [2022-12-09 12:37:11,869 INFO] Step 200/100000; acc: 21.4; ppl: 959.9; xent: 6.9; lr: 0.00005; sents: 153056; bsz: 8588/10584/306; 53155/65510 tok/s; 199 sec; [2022-12-09 12:38:33,601 INFO] Step 300/100000; acc: 25.8; ppl: 415.1; xent: 6.0; lr: 0.00007; sents: 139424; bsz: 8796/10695/279; 53810/65430 tok/s; 281 sec; Fusedadam apex old legacy code [2022-12-09 12:42:04,245 INFO] Step 100/100000; acc: 16.1; ppl: 6136.2; xent: 8.7; lr: 0.00002; sents: 169920; bsz: 8605/10659/340; 37141/46004 tok/s; 116 sec; [2022-12-09 12:43:21,080 INFO] Step 200/100000; acc: 21.4; ppl: 959.9; xent: 6.9; lr: 0.00005; sents: 153056; bsz: 8588/10584/306; 55886/68875 tok/s; 193 sec; [2022-12-09 12:44:41,898 INFO] Step 300/100000; acc: 25.8; ppl: 415.1; xent: 6.0; lr: 0.00007; sents: 139424; bsz: 8796/10695/279; 54419/66169 tok/s; 274 sec; As you can see the second one (Adam(fused=True) is not faster, clearly much slower than the Apex implementations, and accuracy, ppl, loss are off compared to the 3 others. ### Versions pytorch 1.13 Ubuntu 20.04 OpentNMT-py 3.0.1 cc @ngimel @vincentqb @jbschlosser @albanD @janeyx99

[and just last night with bnb.optim.Adam8bit (even paged or not)]

martin_bombin · June 1, 2023, 3:49pm

How did you do that? I can not fit it in my hardware, It seems that I have not enought RAM. I have 63GB of ram. I get the train.py killed when I wun it:

[2023-06-01 15:19:38,254 INFO] 8bit compression of layer w_2                                                                                                                     
[2023-06-01 15:19:54,795 INFO] 8bit compression of layer w_3                                                                                                                     
[2023-06-01 15:20:11,264 INFO] Adding LoRa layers for linear_values                        
[2023-06-01 15:20:25,309 INFO] Adding LoRa layers for linear_query                         
[2023-06-01 15:20:39,329 INFO] Adding LoRa layers for linear_keys                          
Killed

With these Lora setting:


# LoRa
lora_layers: ['linear_values', 'linear_query', 'linear_keys', 'final_linear']
lora_rank: 2
lora_dropout: 0.05
lora_alpha: 16
lora_embedding: false

So it gets kill between Adding LoRa layers for linear_keys and Adding LoRa layers for final_linear.

I am using the last OpenNMT commit. Is there a way to start moving things to GPU before it overloads my RAM?

I do not know if something else can be overloading my RAM. My full config is this one:

############################
#### Finetuning corpora ####
############################
data:
    alpaca:
        path_src: dataAI/alpaca_clean_no_input.txt
        transforms: [sentencepiece, filtertoolong]
        weight: 10
    open_assistant:
        path_src: dataAI/osst1_flattened_txt
        transforms: [sentencepiece, filtertoolong]
        weight: 10


################################
#### Subword and vocabulary #### 
################################
src_subword_model: llama/tokenizer.model
tgt_subword_model: llama/tokenizer.model

src_vocab: vocab.txt
vocab_size_multiple: 1
share_vocab: True

################
#### Filter ####
################
src_seq_length: 1024
tgt_seq_length: 1024

# silently ignore empty lines in the data
skip_empty_level: silent

#######################
####  General opts ####
#######################
train_from: llama13B-vicuna-onmt.pt
save_model: finetuned_llama13B/llama13B-vicuna-onmt
override_opts: true # to apply LoRa
report_every: 1
save_checkpoint_steps: 1000

save_data:  finetuned_llama13B/samples
dump_samples: false
n_sample: 0

tensorboard: true
tensorboard_log_dir: finetuned_llama13B/logs

#################
##### Model #####
#################

# Recall of the overrided opts
model_task: lm
decoder_type: transformer_lm
add_qkvbias: false
dec_layers: 40
heads: 40
hidden_size: 5120
word_vec_size: 5120
transformer_ff: 11008
dropout_steps: [0]
dropout: [0.0]
attention_dropout: [0.0]

model_dtype: fp16
seed: 1234
param_init: 0.0
param_init_glorot: true
batch_size: 512
normalization: tokens
valid_batch_size: 256
train_steps: 4000
optim: fusedadam
max_grad_norm: 0.0
adam_beta2: 0.998
learning_rate : 2e-05

# LLama compatibiliy
layer_norm: rms
pos_ffn_activation_fn: 'silu'
max_relative_positions: -1
position_encoding: false


# LoRa
lora_layers: ['linear_values', 'linear_query', 'linear_keys', 'final_linear']
lora_rank: 2
lora_dropout: 0.05
lora_alpha: 16
lora_embedding: false

# 8bit compression
quant_layers: ['w_1', 'w_2', 'w_3']

# GPU 
world_size: 1
gpu_ranks: [0]

# Optimization
accum_count: [32]
accum_steps: [0]
batch_size: 512 
valid_batch_size: 512
batch_type: tokens
batch_size_multiple: 1
bucket_size: 1024

vince62s · June 1, 2023, 4:09pm

for the 13B you need to wait for the pending PR .

karandua2016 · June 2, 2023, 11:33am

After a few hiccups, the training works for me on a V100. Do you know how I can utilize both the GPUs on 2 x V100 cluster. Currently I see only one GPU being utilized. I have tried setting “-gpuid 0 1” in the script as well.

Edit: Just encountered OOM on 1 GPU.

vince62s · June 2, 2023, 11:50am

Add this in the yaml

world_size: 4 
gpu_ranks: [0,1,2,3]

if you wait for the next PR, it will be optimized by far.

also what were the hiccups so that I can update the tuto ?

karandua2016 · June 2, 2023, 12:17pm

Thanks for the quick response. Unfortunately for me, it goes OOM even when using both the GPUs. Do you think the upcoming PR should be able to resolve that?

vince62s · June 2, 2023, 1:26pm

You can now git pull and use these options in the yaml file:

#4/8bit
quant_layers: ['w_1', 'w_2', 'w_3', 'linear_values', 'linear_query', 'linear_keys', 'final_linear']
quant_type: "bnb_NF4"

#LoRa
lora_layers: ['linear_values', 'linear_query', 'linear_keys', 'final_linear']
lora_rank: 8
lora_dropout: 0.05
lora_alpha: 16
lora_embedding: false

# Chekpointing
#use_ckpting: ['ffn', 'lora']

You should not have to activate the last option, the 13B just trains fine with 512 sequence length and batchsize (on a RTX 24GB card). If you are using a V100 with 16GB you need to stick to the 7B model)

karandua2016 · June 2, 2023, 2:27pm

Yes, I am using MPT 7B itself. Unfortunately, with the updated code as well, I run OOM: Following is the config:

# Corpus opts:
data:
    alpaca:
        path_src: "onmt-mpt/alpaca_clean.txt"
        transforms: [onmt_tokenize, filtertoolong]
        weight: 10
    sharegpt:
        path_src: "onmt-mpt/sharegpt.txt"
        transforms: [onmt_tokenize, filtertoolong]
        weight: 10
        

### Transform related opts:
#### Subword
src_subword_type: bpe
src_subword_model: "onmt-mpt/mpt-model.bpe"
src_onmttok_kwargs: '{"mode": "conservative"}'

tgt_subword_type: bpe
tgt_subword_model: "onmt-mpt/mpt-model.bpe"
tgt_onmttok_kwargs: '{"mode": "conservative"}'
gpt2_pretok: true

#### Filter
src_seq_length: 512
tgt_seq_length: 512

# silently ignore empty lines in the data
skip_empty_level: silent

# General opts
train_from: "onmt-mpt/mpt_chat_onmt.pt"
save_model: "./mpt7B-vicuna-onmt"
keep_checkpoint: 10
save_checkpoint_steps: 400
seed: 1234
report_every: 10
train_steps: 4000
valid_steps: 400

# Batching
bucket_size: 32768
#bucket_size: 1
num_workers: 2
world_size: 2
gpu_ranks: [0, 1]
batch_type: "tokens"
batch_size: 896 
valid_batch_size: 256
batch_size_multiple: 1
accum_count: [32]
accum_steps: [0]

override_opts: true  # CAREFULL this requires all settings to be defined below

share_vocab: true
save_data: "/dataAI"
src_vocab: "onmt-mpt/mpt.vocab"
src_vocab_size: 50432
tgt_vocab_size: 50432
default_specials: ['</s>', '<blank>']

decoder_start_token: '</s>'
# Optimization
model_dtype: "fp16"
apex_opt_level: ""
optim: "fusedadam"
learning_rate: 0.0002
warmup_steps: 100
decay_method: "none"
#learning_rate_decay: 0.98
#start_decay_steps: 100
#decay_steps: 10
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.0
param_init: 0
param_init_glorot: true
normalization: "tokens"

#4/8bit
quant_layers: ['w_1', 'w_2', 'w_3', 'linear_values', 'linear_query', 'linear_keys', 'final_linear']
quant_type: "bnb_NF4"
#
##LoRa
lora_layers: ['linear_values', 'linear_query', 'linear_keys', 'final_linear']
lora_rank: 8
lora_dropout: 0.05
lora_alpha: 16
lora_embedding: false

# Model
model_task: lm
decoder_type: transformer_lm
layer_norm: standard
pos_ffn_activation_fn: 'gelu'
max_relative_positions: -2
position_encoding: false
add_qkvbias: false
dec_layers: 32
heads: 32
hidden_size: 4096
word_vec_size: 4096
transformer_ff: 16384
dropout_steps: [0]
dropout: [0.0]
attention_dropout: [0.0]

vince62s · June 2, 2023, 2:43pm

if you are using a 16GB V100 just reduce your batch_size: 896 to 512 or lower if still too high.

martin_bombin · June 2, 2023, 3:07pm

I keep having same RAM problems at the same spot when I load the model, does not matter what quantization I use or whether I quantize Lora weights or not.

vince62s · June 2, 2023, 3:56pm

can you monitor your cpu ram / swap disk while this is happening ?
I have 64GB of RAM too, and a swap of 256GB

martin_bombin · June 2, 2023, 4:14pm

It seems that I have no swap memory at all:

I saw that it reachs the 62.8GB when it tries to add the final_linear lora layer:

[2023-06-01 15:19:38,254 INFO] 8bit compression of layer w_2                                                                                                                     
[2023-06-01 15:19:54,795 INFO] 8bit compression of layer w_3                                                                                                                     
[2023-06-01 15:20:11,264 INFO] Adding LoRa layers for linear_values                        
[2023-06-01 15:20:25,309 INFO] Adding LoRa layers for linear_query                         
[2023-06-01 15:20:39,329 INFO] Adding LoRa layers for linear_keys                          
Killed

It does not matter what quantization I use or whether I quantize Lora weights or not. It also does not matter if I use new branch or not (I see different logging for new branch version, but it gets kill in the same execution point).

vince62s · June 2, 2023, 4:20pm

I know this is just because the checkpoint is not sharded and as is, the cpu ram needs to fit the state_dict + the model.
I will change this in the coming days but in the meantime it is an easy fix to add some swap space.