Device side assert triggered on AWQ Mistral converted model

I have converted the TheBloke/Starling-LM-7B-alpha-AWQ model using the following command -
python tools/convert_HF.py --model_dir TheBloke/Starling-LM-7B-alpha-AWQ --output ./Starling-LM-7B-alpha-AWQ-onmt/ --format pytorch --nshards 1

And I am not able to run the inference on the converted model. Getting the following error -
Command I am using to run - python translate.py --config ./Starling-LM-7B-alpha-AWQ-onmt/inference.yaml --src ./input_prompt.txt --output ./output.txt
input_prompt.txt content -
GPT-4 User: How do you manage stress?<|end_of_turn|>GPT4 Assistant:

Traceback (most recent call last):
  File "/mnt/sea/c2/OpenNMT-py/translate.py", line 6, in <module>
    main()
  File "/mnt/sea/c2/OpenNMT-py/onmt/bin/translate.py", line 47, in main
    translate(opt)
  File "/mnt/sea/c2/OpenNMT-py/onmt/bin/translate.py", line 22, in translate
    _, _ = engine.infer_file()
  File "/mnt/sea/c2/OpenNMT-py/onmt/inference_engine.py", line 35, in infer_file
    scores, preds = self._translate(infer_iter)
  File "/mnt/sea/c2/OpenNMT-py/onmt/inference_engine.py", line 159, in _translate
    scores, preds = self.translator._translate(
  File "/mnt/sea/c2/OpenNMT-py/onmt/translate/translator.py", line 496, in _translate
    batch_data = self.translate_batch(batch, attn_debug)
  File "/mnt/sea/c2/OpenNMT-py/onmt/translate/translator.py", line 1067, in translate_batch
    return self._translate_batch_with_strategy(batch, decode_strategy)
  File "/mnt/sea/c2/OpenNMT-py/onmt/translate/translator.py", line 1149, in _translate_batch_with_strategy
    decode_strategy.advance(log_probs, attn)
  File "/mnt/sea/c2/OpenNMT-py/onmt/translate/beam_search.py", line 432, in advance
    super(BeamSearchLM, self).advance(log_probs, attn)
  File "/mnt/sea/c2/OpenNMT-py/onmt/translate/beam_search.py", line 379, in advance
    self.is_finished_list = self.topk_ids.eq(self.eos).tolist()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [228,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [228,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [228,0,0], thread: [98,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [228,0,0], thread: [99,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

This is my inference.yaml file content -

transforms: [sentencepiece]

src_subword_model: "Starling-LM-7B-alpha-AWQ-onmt/tokenizer.model"
tgt_subword_model: "Starling-LM-7B-alpha-AWQ-onmt/tokenizer.model"

model: "Starling-LM-7B-alpha-AWQ-onmt/Starling-LM-7B-alpha-AWQ-onmt.pt"

seed: 13
max_length: 256
gpu: 0
batch_type: sents
batch_size: 60
world_size: 1
gpu_ranks: [0]

precision: fp16
beam_size: 1
n_best: 1
profile: false
report_time: true
src: None

And I have one more question - I am not able to understand the example prompts provided for the mistral model - like the tokens used over there i.e. ⦅newline⦆. I’d appreciate it if you could provide some explanation or documentation link for this.

My suggestions:

  1. try with TheBloke/Mistral-7B-Instruct-v0.2-AWQ
    if it works, then try to copy locally TheBloke/Starling-LM-7B-alpha-AWQ and point the conversion to your local folder
  2. but change the vocab_size in the config.json from 32002 to 32000
    The conversion script convert_HF does not handle the tokenizerç_config.json with extra info like additional tokens etc…

For your last question ⦅newline⦆ is just the equivalent of “\n” but since we do not handle json files but text files we had to use this trick.

Hi @vince62s,

  1. TheBloke/Mistral-7B-Instruct-v0.2-AWQ is working fine
  2. TheBloke/Starling-LM-7B-alpha-AWQ is still yielding the same error even after changing vocab_size.

Upon further debugging I got to know, that this problem is not happening for all the prompts. It is specifically happening when 32000 token is generated. (This happens in many cases)
32000 token is associated with <|end_of_turn|>`.

It seems like a problem with respect to extra tokens.
Error logs -

Traceback (most recent call last):
  File "/mnt/sea/c2/OpenNMT-py/translate.py", line 6, in <module>
    main()
  File "/mnt/sea/c2/OpenNMT-py/onmt/bin/translate.py", line 47, in main
    translate(opt)
  File "/mnt/sea/c2/OpenNMT-py/onmt/bin/translate.py", line 22, in translate
    _, _ = engine.infer_file()
  File "/mnt/sea/c2/OpenNMT-py/onmt/inference_engine.py", line 35, in infer_file
    scores, preds = self._translate(infer_iter)
  File "/mnt/sea/c2/OpenNMT-py/onmt/inference_engine.py", line 159, in _translate
    scores, preds = self.translator._translate(
  File "/mnt/sea/c2/OpenNMT-py/onmt/translate/translator.py", line 496, in _translate
    batch_data = self.translate_batch(batch, attn_debug)
  File "/mnt/sea/c2/OpenNMT-py/onmt/translate/translator.py", line 1068, in translate_batch
    return self._translate_batch_with_strategy(batch, decode_strategy)
  File "/mnt/sea/c2/OpenNMT-py/onmt/translate/translator.py", line 1135, in _translate_batch_with_strategy
    log_probs, attn = self._decode_and_generate(
  File "/mnt/sea/c2/OpenNMT-py/onmt/translate/translator.py", line 684, in _decode_and_generate
    dec_out, dec_attn = self.model.decoder(
  File "/home/kd/anaconda3/envs/itv2_hf/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/kd/anaconda3/envs/itv2_hf/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/sea/c2/OpenNMT-py/onmt/decoders/transformer.py", line 959, in forward
    dec_out = self.embeddings(tgt, step=step)
  File "/home/kd/anaconda3/envs/itv2_hf/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/kd/anaconda3/envs/itv2_hf/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/sea/c2/OpenNMT-py/onmt/modules/embeddings.py", line 303, in forward
    source = self.make_embedding(source)
  File "/home/kd/anaconda3/envs/itv2_hf/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/kd/anaconda3/envs/itv2_hf/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/kd/anaconda3/envs/itv2_hf/lib/python3.9/site-packages/torch/nn/modules/container.py", line 215, in forward
    input = module(input)
  File "/home/kd/anaconda3/envs/itv2_hf/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/kd/anaconda3/envs/itv2_hf/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/sea/c2/OpenNMT-py/onmt/modules/util_class.py", line 28, in forward
    emb_out.append(f(x))
  File "/home/kd/anaconda3/envs/itv2_hf/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/kd/anaconda3/envs/itv2_hf/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/kd/anaconda3/envs/itv2_hf/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 162, in forward
    return F.embedding(
  File "/home/kd/anaconda3/envs/itv2_hf/lib/python3.9/site-packages/torch/nn/functional.py", line 2233, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.```

ok, I did not handle all the specific cases / adaptations in my converter here:

You can try to debug a few things

Open a ipython console:

m = torch.load(“your_onmt_model”)

len(m[“vocabs”][“src”])

does it say 32000 or 32002

then m[‘model’].keys()
check the tensor size of the embedding tensor
something like:
m[‘model’][‘decoder.embeddings.make_embedding.emb_luts.0.weight’].size()
and
m[‘model’][‘generator.weight’].size()

Hi @vince62s,
Thanks for the directions. They were helpful and I was able to fix the problem by appending two tokens in convertHF.py.

tokenizer = Tokenizer(model_path=tokenizer_model)
vocab = tokenizer.vocab
vocab.append("<|end_of_turn|>")
vocab.append("<|pad_0|>")

In translater.py Inference class, changing the eos, pad idx-

self._tgt_eos_idx = vocabs["tgt"].lookup_token(DefaultTokens.EOS)
self._tgt_eos_idx = vocabs["tgt"].lookup_token("<|end_of_turn|>")
self._tgt_pad_idx = vocabs["tgt"].lookup_token(DefaultTokens.PAD)
self._tgt_pad_idx = vocabs["tgt"].lookup_token("<|pad_0|>")

And also adding this in _build_target_tokens()

if tokens[-1] == “<|end_of_turn|>”:
tokens = tokens[:-1]

I have one more question regarding the inference parameter, How can we change repetition_penalty during inference in OpenNTM-py?

There is no such setting, the only one we have is this:

but honestly I have never really used it so unsure whether still performing ok.