Finetuning Llama-7B/13B or MosaicML MPT-7B - Reproduce Vicuna / Alpaca

Thanks! With about same config it’s working good. I just changed lr to 0.0002. But after first save on step 200 and evaluation I got these errors:

[2023-05-08 12:13:07,546 INFO] Train perplexity: 3.67446
[2023-05-08 12:13:07,546 INFO] Train accuracy: 67.1748
[2023-05-08 12:13:07,546 INFO] Sentences processed: 34564
[2023-05-08 12:13:07,546 INFO] Average bsz:  767/ 767/ 5
[2023-05-08 12:13:07,546 INFO] Validation perplexity: 2.86707
[2023-05-08 12:13:07,546 INFO] Validation accuracy: 71.36
[2023-05-08 12:13:07,689 INFO] Saving checkpoint ready/llama7b-main.pt_step_200.pt
[2023-05-08 12:13:09,368 INFO] Step 201, cuda OOM - batch removed
[2023-05-08 12:13:09,484 INFO] Step 201, cuda OOM - batch removed
[2023-05-08 12:13:09,512 INFO] Step 201, cuda OOM - batch removed
.....
TypeError: multi_tensor_l2norm(): incompatible function arguments. The following argument types are supported:
    1. (arg0: int, arg1: torch.Tensor, arg2: List[List[torch.Tensor]], arg3: Optional[bool]) -> Tuple[torch.Tensor, torch.Tensor]

Invoked with: 65536, tensor([0], device='cuda:0', dtype=torch.int32), [[None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]], True

And I can’t start again with saved checkpoint. Trying to change train_from to saved checkpoint and get:

Traceback (most recent call last):
  File "/opt/conda/bin/onmt_train", line 33, in <module>
    sys.exit(load_entry_point('OpenNMT-py==3.1.1', 'console_scripts', 'onmt_train')())
  File "/opt/conda/lib/python3.10/site-packages/OpenNMT_py-3.1.1-py3.10.egg/onmt/bin/train.py", line 65, in main
    train(opt)
  File "/opt/conda/lib/python3.10/site-packages/OpenNMT_py-3.1.1-py3.10.egg/onmt/bin/train.py", line 50, in train
    train_process(opt, device_id=0)
  File "/opt/conda/lib/python3.10/site-packages/OpenNMT_py-3.1.1-py3.10.egg/onmt/train_single.py", line 164, in main
    model = build_model(model_opt, opt, vocabs, checkpoint)
  File "/opt/conda/lib/python3.10/site-packages/OpenNMT_py-3.1.1-py3.10.egg/onmt/model_builder.py", line 414, in build_model
    model = build_base_model(model_opt, vocabs, checkpoint)
  File "/opt/conda/lib/python3.10/site-packages/OpenNMT_py-3.1.1-py3.10.egg/onmt/model_builder.py", line 385, in build_base_model
    if '0.weight' in checkpoint['generator']:
TypeError: argument of type 'NoneType' is not iterable

The first error results from too many OOM.
if you use a dataset that is not the same as mine, maybe try to reduce the batch size a bit, or maybe you have some other processes using your gpu.

anyway, you cannot train_from a LoRa checkpoint directly.
Either you start again (recommended because otherwise it will start from the beg of the dataset)
Or
you would need to use the tool lora_weights with --action concat
and train from the resulting merged checkpoint.

1 Like

Thanks. Looks like when saving checkpoint, some additional memory used and it’s not free’d after. But anyways, nice work!

How to convert finetuned model to ctranslate2?

it won’t work for now. It requires some changes in the Onmt-py => CT2 converter.
You’ll have to be patient, inference works fine in -py.

1 Like

MosaicML released another 7B model that is more permissive in terms of usage.
Here is the blog page: Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs

The architecture is slightly different compared to llama but we added those features in OpenNMT-py.

The first step is to convert the Hugging Face checkpoint format into OpenNMT-py format.

Use the following converter tools/convert_mpt.py from the repo.

Then download the bpe model and the vocab file here:
https://opennmt-models.s3.amazonaws.com/mosaic-MPT/mpt-model.bpe
https://opennmt-models.s3.amazonaws.com/mosaic-MPT/mpt.vocab
Those were created from the tokenizer.json file on the Hugging Face repo.

You will also need to get the Alpaca and sharegpt data file from the first post in this thread (same files as for llama finetuning).

Then you can use this config file and get running for finetuning with LoRa + 8bit loading:

# Corpus opts:
data:
    alpaca:
        path_src: "alpaca_clean.txt"
        transforms: [onmt_tokenize, filtertoolong]
        weight: 10
    sharegpt:
        path_src: "sharegpt.txt"
        transforms: [onmt_tokenize, filtertoolong]
        weight: 10

    valid:
        path_src: "valid.txt"
        transforms: [onmt_tokenize]

### Transform related opts:
#### Subword
src_subword_type: bpe
src_subword_model: "mpt-model.bpe"
src_onmttok_kwargs: '{"mode": "conservative"}'

tgt_subword_type: bpe
tgt_subword_model: "mpt-model.bpe"
tgt_onmttok_kwargs: '{"mode": "conservative"}'
gpt2_pretok: true

#### Filter
src_seq_length: 512
tgt_seq_length: 512

# silently ignore empty lines in the data
skip_empty_level: silent

# General opts
train_from: "mpt7b-onmt.pt"
save_model: "/mpt7B/mpt7B-vicuna-onmt"
keep_checkpoint: 10
save_checkpoint_steps: 400
seed: 1234
report_every: 10
train_steps: 4000
valid_steps: 400

# Batching
bucket_size: 32768
#bucket_size: 1
num_workers: 2
world_size: 1
gpu_ranks: [0]
batch_type: "tokens"
batch_size: 896
valid_batch_size: 256
batch_size_multiple: 1
accum_count: [32]
accum_steps: [0]

override_opts: true  # CAREFULL this requires all settings to be defined below

share_vocab: true
save_data: "/dataAI"
src_vocab: "mpt.vocab"
src_vocab_size: 50432
tgt_vocab_size: 50432
default_specials: ['</s>', '<blank>']

decoder_start_token: '</s>'
# Optimization
model_dtype: "fp16"
apex_opt_level: ""
optim: "fusedadam"
learning_rate: 0.0002
warmup_steps: 100
decay_method: "none"
#learning_rate_decay: 0.98
#start_decay_steps: 100
#decay_steps: 10
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.0
param_init: 0
param_init_glorot: true
normalization: "tokens"

#LoRa
lora_layers: ['linear_values', 'linear_query', 'linear_keys', 'final_linear']
lora_rank: 8
lora_dropout: 0.05
lora_alpha: 16
lora_embedding: false

#8bit
quant_layers: ['w_1', 'w_2']
# Model
model_task: lm
decoder_type: transformer_lm
layer_norm: standard
pos_ffn_activation_fn: 'gelu'
max_relative_positions: -2
position_encoding: false
add_qkvbias: false
dec_layers: 32
heads: 32
hidden_size: 4096
word_vec_size: 4096
transformer_ff: 16384
dropout_steps: [0]
dropout: [0.0]
attention_dropout: [0.0]

It took 8-9 hours on my RTX4090

Then you need to merge the LoRa weights into the base model (same as for llama).

Inference is also similar, just make sure you set up onmt_tokenize and the bpe model instead of sentencepiece.

transforms: [onmt_tokenize]

#### Subword
src_subword_type: bpe
src_subword_model: "mpt-model.bpe"
src_onmttok_kwargs: '{"mode": "conservative"}'

tgt_subword_type: bpe
tgt_subword_model: "mpt-model.bpe"
tgt_onmttok_kwargs: '{"mode": "conservative"}'
**gpt2_pretok: true**
# Model info
model: "mpt7B/mpt7B-vicuna-merged-onmt_step_4000.pt"

# Inference
seed: 42
max_length: 512
gpu: 0
batch_type: sents
batch_size: 1
precision: fp16
random_sampling_topk: 40
random_sampling_topp: 0.75
random_sampling_temp: 0.1
beam_size: 1
report_time: true

Output is very similar to llama finetuned.

DISCLAIMER:
While the MPT-7B is more permissive (commercial usage allowed) it is unclear whether the alpaca / sharegpt datasets are allowed for commercial usage. For Alpaca, it seems that they have been generated through the OpenAI API which restricts the downstream usage. Sharegpt seems to be ChatGPT web output for which the TOS is different.

The best would be to finetune using the OpenAssistant dataset.

2 Likes

Step-by-step tuto for “Vicuna” replication.

1 Like

I parsed the Open Assistant Dataset and made it OpenNMT-py friendly for finetuning.

https://opennmt-models.s3.amazonaws.com/llama/osst1.flattened.txt

Can be tested now with:
Llama
MPT7B
Open Llama
Redpajama

I have not tested any finetuning with this dataset but 1) it’s fully open source 2) The Open Assistant project reports good results.

1 Like

Can you give configuration example for training Redpajama 3B ?

Llama 13B fits on a RTX with 24GB / 512 context length

[2023-05-31 16:17:05,778 INFO] Step 10/ 4000; acc: 68.7; ppl:   3.6; xent: 1.3; lr: 0.00020; sents:     602; bsz:  402/ 402/ 2; 812/812 tok/s;    159 sec;
[2023-05-31 16:18:53,812 INFO] Step 20/ 4000; acc: 73.1; ppl:   2.8; xent: 1.0; lr: 0.00020; sents:     505; bsz:  392/ 392/ 2; 1161/1161 tok/s;    267 sec;
[2023-05-31 16:20:37,474 INFO] Step 30/ 4000; acc: 73.8; ppl:   2.7; xent: 1.0; lr: 0.00020; sents:     506; bsz:  395/ 395/ 2; 1219/1219 tok/s;    370 sec;
[2023-05-31 16:22:16,611 INFO] Step 40/ 4000; acc: 73.3; ppl:   2.7; xent: 1.0; lr: 0.00020; sents:     460; bsz:  384/ 384/ 1; 1240/1240 tok/s;    469 sec;
[2023-05-31 16:23:56,426 INFO] Step 50/ 4000; acc: 74.8; ppl:   2.6; xent: 0.9; lr: 0.00020; sents:     542; bsz:  394/ 394/ 2; 1263/1263 tok/s;    569 sec;
[2023-05-31 16:25:38,707 INFO] Step 60/ 4000; acc: 74.6; ppl:   2.6; xent: 0.9; lr: 0.00020; sents:     510; bsz:  395/ 395/ 2; 1235/1235 tok/s;    671 sec;
[2023-05-31 16:27:19,223 INFO] Step 70/ 4000; acc: 74.5; ppl:   2.6; xent: 0.9; lr: 0.00020; sents:     444; bsz:  383/ 383/ 1; 1219/1219 tok/s;    772 sec;
[2023-05-31 16:29:02,811 INFO] Step 80/ 4000; acc: 75.2; ppl:   2.5; xent: 0.9; lr: 0.00020; sents:     550; bsz:  394/ 394/ 2; 1217/1217 tok/s;    876 sec;
[2023-05-31 16:30:45,309 INFO] Step 90/ 4000; acc: 75.1; ppl:   2.5; xent: 0.9; lr: 0.00020; sents:     559; bsz:  393/ 393/ 2; 1226/1226 tok/s;    978 sec;
[2023-05-31 16:32:27,957 INFO] Step 100/ 4000; acc: 75.5; ppl:   2.5; xent: 0.9; lr: 0.00020; sents:     495; bsz:  396/ 396/ 2; 1234/1234 tok/s;   1081 sec;

Hi, can you explain how to get the “mpt7b-onmt.pt” file in this yaml file? Thank you!

I got it doing something like this for chat version:

python

import transformers
model = transformers.AutoModelForCausalLM.from_pretrained(
  'mosaicml/mpt-7b-chat',
  trust_remote_code=True
)
model.save_pretrained("mpt_chat")

quit()

python …/OpenNMT-py/tools/convert_mpt.py
–model_dir mpt_chat
–vocab_file mpt.vocab
–output mpt_chat_onmt.pt

you can also directly do

python …/OpenNMT-py/tools/convert_mpt.py
–model_dir mosaicml/mpt-7b
–vocab_file mpt.vocab
–output mpt7b-onmt.pt

Have a look the convert script, it uses the same AutoModelForCausalLM.from pretrained()

This can take a relative directory and will pick directly the files on the HF hub, or a local directory if you have previously downloaded the .bin .json files.

Is convet file (ex: mpt7B_onmt.pt) different form “mpt7B-vicuna-onmt” like below?

When I run python3 OpenNMT-py/onmt/bin/train.py -config Finetune/mpt/mpt_vicuna.yaml and got this error:

Traceback (most recent call last):
File “OpenNMT-py/onmt/bin/train.py”, line 71, in
main()
File “OpenNMT-py/onmt/bin/train.py”, line 67, in main
train(opt)
File “OpenNMT-py/onmt/bin/train.py”, line 52, in train
train_process(opt, device_id=0)
File “/root/OpenNMT-py/onmt/train_single.py”, line 196, in main
optim = Optimizer.from_opt(model, opt, checkpoint=checkpoint)
File “/root/OpenNMT-py/onmt/utils/optimizers.py”, line 274, in from_opt
build_torch_optimizer(model, optim_opt),
File “/root/OpenNMT-py/onmt/utils/optimizers.py”, line 76, in build_torch_optimizer
optimizer = FusedAdam(params, lr=opt.learning_rate, betas=betas)
File “/root/OpenNMT-py/onmt/utils/optimizers.py”, line 635, in init
fused_adam_cuda = importlib.import_module(“fused_adam_cuda”)
File “/opt/conda/lib/python3.8/importlib/init.py”, line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File “”, line 1014, in _gcd_import
File “”, line 991, in _find_and_load
File “”, line 975, in _find_and_load_unlocked
File “”, line 657, in _load_unlocked
File “”, line 556, in module_from_spec
File “”, line 1101, in create_module
File “”, line 219, in _call_with_frames_removed
ImportError: /opt/conda/lib/python3.8/site-packages/fused_adam_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol: _Z13__THCudaCheck9cudaErrorPKci

Finetune/mpt/mpt_vicuna.yaml is the same with the above MPT yaml

Can you help me to solve?

Thank a lot!

you need to install apex.
like here

this will give you access to fusedadam which by far the best optimizer in terms of speed and memory footprint. I compared to adam (even with fuse=True).

[and just last night with bnb.optim.Adam8bit (even paged or not)]

1 Like

How did you do that? I can not fit it in my hardware, It seems that I have not enought RAM. I have 63GB of ram. I get the train.py killed when I wun it:

[2023-06-01 15:19:38,254 INFO] 8bit compression of layer w_2                                                                                                                     
[2023-06-01 15:19:54,795 INFO] 8bit compression of layer w_3                                                                                                                     
[2023-06-01 15:20:11,264 INFO] Adding LoRa layers for linear_values                        
[2023-06-01 15:20:25,309 INFO] Adding LoRa layers for linear_query                         
[2023-06-01 15:20:39,329 INFO] Adding LoRa layers for linear_keys                          
Killed      

With these Lora setting:


# LoRa
lora_layers: ['linear_values', 'linear_query', 'linear_keys', 'final_linear']
lora_rank: 2
lora_dropout: 0.05
lora_alpha: 16
lora_embedding: false

So it gets kill between Adding LoRa layers for linear_keys and Adding LoRa layers for final_linear.

I am using the last OpenNMT commit. Is there a way to start moving things to GPU before it overloads my RAM?

I do not know if something else can be overloading my RAM. My full config is this one:

############################
#### Finetuning corpora ####
############################
data:
    alpaca:
        path_src: dataAI/alpaca_clean_no_input.txt
        transforms: [sentencepiece, filtertoolong]
        weight: 10
    open_assistant:
        path_src: dataAI/osst1_flattened_txt
        transforms: [sentencepiece, filtertoolong]
        weight: 10


################################
#### Subword and vocabulary #### 
################################
src_subword_model: llama/tokenizer.model
tgt_subword_model: llama/tokenizer.model

src_vocab: vocab.txt
vocab_size_multiple: 1
share_vocab: True

################
#### Filter ####
################
src_seq_length: 1024
tgt_seq_length: 1024

# silently ignore empty lines in the data
skip_empty_level: silent

#######################
####  General opts ####
#######################
train_from: llama13B-vicuna-onmt.pt
save_model: finetuned_llama13B/llama13B-vicuna-onmt
override_opts: true # to apply LoRa
report_every: 1
save_checkpoint_steps: 1000

save_data:  finetuned_llama13B/samples
dump_samples: false
n_sample: 0

tensorboard: true
tensorboard_log_dir: finetuned_llama13B/logs

#################
##### Model #####
#################

# Recall of the overrided opts
model_task: lm
decoder_type: transformer_lm
add_qkvbias: false
dec_layers: 40
heads: 40
hidden_size: 5120
word_vec_size: 5120
transformer_ff: 11008
dropout_steps: [0]
dropout: [0.0]
attention_dropout: [0.0]

model_dtype: fp16
seed: 1234
param_init: 0.0
param_init_glorot: true
batch_size: 512
normalization: tokens
valid_batch_size: 256
train_steps: 4000
optim: fusedadam
max_grad_norm: 0.0
adam_beta2: 0.998
learning_rate : 2e-05

# LLama compatibiliy
layer_norm: rms
pos_ffn_activation_fn: 'silu'
max_relative_positions: -1
position_encoding: false


# LoRa
lora_layers: ['linear_values', 'linear_query', 'linear_keys', 'final_linear']
lora_rank: 2
lora_dropout: 0.05
lora_alpha: 16
lora_embedding: false

# 8bit compression
quant_layers: ['w_1', 'w_2', 'w_3']

# GPU 
world_size: 1
gpu_ranks: [0]

# Optimization
accum_count: [32]
accum_steps: [0]
batch_size: 512 
valid_batch_size: 512
batch_type: tokens
batch_size_multiple: 1
bucket_size: 1024

for the 13B you need to wait for the pending PR .

1 Like

After a few hiccups, the training works for me on a V100. Do you know how I can utilize both the GPUs on 2 x V100 cluster. Currently I see only one GPU being utilized. I have tried setting “-gpuid 0 1” in the script as well.

Edit: Just encountered OOM on 1 GPU.

Add this in the yaml

world_size: 4 
gpu_ranks: [0,1,2,3]

if you wait for the next PR, it will be optimized by far.

also what were the hiccups so that I can update the tuto ?

Thanks for the quick response. Unfortunately for me, it goes OOM even when using both the GPUs. Do you think the upcoming PR should be able to resolve that?