I have converted the TheBloke/Starling-LM-7B-alpha-AWQ
model using the following command -
python tools/convert_HF.py --model_dir TheBloke/Starling-LM-7B-alpha-AWQ --output ./Starling-LM-7B-alpha-AWQ-onmt/ --format pytorch --nshards 1
And I am not able to run the inference on the converted model. Getting the following error -
Command I am using to run - python translate.py --config ./Starling-LM-7B-alpha-AWQ-onmt/inference.yaml --src ./input_prompt.txt --output ./output.txt
input_prompt.txt content -
GPT-4 User: How do you manage stress?<|end_of_turn|>GPT4 Assistant:
Traceback (most recent call last):
File "/mnt/sea/c2/OpenNMT-py/translate.py", line 6, in <module>
main()
File "/mnt/sea/c2/OpenNMT-py/onmt/bin/translate.py", line 47, in main
translate(opt)
File "/mnt/sea/c2/OpenNMT-py/onmt/bin/translate.py", line 22, in translate
_, _ = engine.infer_file()
File "/mnt/sea/c2/OpenNMT-py/onmt/inference_engine.py", line 35, in infer_file
scores, preds = self._translate(infer_iter)
File "/mnt/sea/c2/OpenNMT-py/onmt/inference_engine.py", line 159, in _translate
scores, preds = self.translator._translate(
File "/mnt/sea/c2/OpenNMT-py/onmt/translate/translator.py", line 496, in _translate
batch_data = self.translate_batch(batch, attn_debug)
File "/mnt/sea/c2/OpenNMT-py/onmt/translate/translator.py", line 1067, in translate_batch
return self._translate_batch_with_strategy(batch, decode_strategy)
File "/mnt/sea/c2/OpenNMT-py/onmt/translate/translator.py", line 1149, in _translate_batch_with_strategy
decode_strategy.advance(log_probs, attn)
File "/mnt/sea/c2/OpenNMT-py/onmt/translate/beam_search.py", line 432, in advance
super(BeamSearchLM, self).advance(log_probs, attn)
File "/mnt/sea/c2/OpenNMT-py/onmt/translate/beam_search.py", line 379, in advance
self.is_finished_list = self.topk_ids.eq(self.eos).tolist()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [228,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [228,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [228,0,0], thread: [98,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [228,0,0], thread: [99,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
This is my inference.yaml file content -
transforms: [sentencepiece]
src_subword_model: "Starling-LM-7B-alpha-AWQ-onmt/tokenizer.model"
tgt_subword_model: "Starling-LM-7B-alpha-AWQ-onmt/tokenizer.model"
model: "Starling-LM-7B-alpha-AWQ-onmt/Starling-LM-7B-alpha-AWQ-onmt.pt"
seed: 13
max_length: 256
gpu: 0
batch_type: sents
batch_size: 60
world_size: 1
gpu_ranks: [0]
precision: fp16
beam_size: 1
n_best: 1
profile: false
report_time: true
src: None
And I have one more question - I am not able to understand the example prompts provided for the mistral model - like the tokens used over there i.e. ⦅newline⦆
. I’d appreciate it if you could provide some explanation or documentation link for this.