Thanks! With about same config it’s working good. I just changed lr to 0.0002. But after first save on step 200 and evaluation I got these errors:
[2023-05-08 12:13:07,546 INFO] Train perplexity: 3.67446
[2023-05-08 12:13:07,546 INFO] Train accuracy: 67.1748
[2023-05-08 12:13:07,546 INFO] Sentences processed: 34564
[2023-05-08 12:13:07,546 INFO] Average bsz: 767/ 767/ 5
[2023-05-08 12:13:07,546 INFO] Validation perplexity: 2.86707
[2023-05-08 12:13:07,546 INFO] Validation accuracy: 71.36
[2023-05-08 12:13:07,689 INFO] Saving checkpoint ready/llama7b-main.pt_step_200.pt
[2023-05-08 12:13:09,368 INFO] Step 201, cuda OOM - batch removed
[2023-05-08 12:13:09,484 INFO] Step 201, cuda OOM - batch removed
[2023-05-08 12:13:09,512 INFO] Step 201, cuda OOM - batch removed
.....
TypeError: multi_tensor_l2norm(): incompatible function arguments. The following argument types are supported:
1. (arg0: int, arg1: torch.Tensor, arg2: List[List[torch.Tensor]], arg3: Optional[bool]) -> Tuple[torch.Tensor, torch.Tensor]
Invoked with: 65536, tensor([0], device='cuda:0', dtype=torch.int32), [[None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]], True
And I can’t start again with saved checkpoint. Trying to change train_from to saved checkpoint and get:
Traceback (most recent call last):
File "/opt/conda/bin/onmt_train", line 33, in <module>
sys.exit(load_entry_point('OpenNMT-py==3.1.1', 'console_scripts', 'onmt_train')())
File "/opt/conda/lib/python3.10/site-packages/OpenNMT_py-3.1.1-py3.10.egg/onmt/bin/train.py", line 65, in main
train(opt)
File "/opt/conda/lib/python3.10/site-packages/OpenNMT_py-3.1.1-py3.10.egg/onmt/bin/train.py", line 50, in train
train_process(opt, device_id=0)
File "/opt/conda/lib/python3.10/site-packages/OpenNMT_py-3.1.1-py3.10.egg/onmt/train_single.py", line 164, in main
model = build_model(model_opt, opt, vocabs, checkpoint)
File "/opt/conda/lib/python3.10/site-packages/OpenNMT_py-3.1.1-py3.10.egg/onmt/model_builder.py", line 414, in build_model
model = build_base_model(model_opt, vocabs, checkpoint)
File "/opt/conda/lib/python3.10/site-packages/OpenNMT_py-3.1.1-py3.10.egg/onmt/model_builder.py", line 385, in build_base_model
if '0.weight' in checkpoint['generator']:
TypeError: argument of type 'NoneType' is not iterable