Inference bug of the MoE GPTQ models #30515

bozheng-hit · 2024-04-27T00:39:03Z

System Info

Generating with GPTQ models encounters the following errors after merging this PR: #30209 @younesbelkada @SunMarc

The error information is here, and the model successfully generates after I revert the change for modeling_qwen2_moe.py.

Traceback (most recent call last):
  File "/home/data/roy.zb/workspace/test_auto_gptq.py", line 23, in <module>
    generated_ids = model.generate(
  File "/cpfs01/shared/public/xingzhang.rxz/anaconda3/envs/qwen_moe/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/data/roy.zb/workspace/transformers/src/transformers/generation/utils.py", line 1656, in generate
    result = self._sample(
  File "/home/data/roy.zb/workspace/transformers/src/transformers/generation/utils.py", line 2819, in _sample
    outputs = self(
  File "/cpfs01/shared/public/xingzhang.rxz/anaconda3/envs/qwen_moe/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/data/roy.zb/workspace/transformers/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 1355, in forward
    outputs = self.model(
  File "/cpfs01/shared/public/xingzhang.rxz/anaconda3/envs/qwen_moe/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/data/roy.zb/workspace/transformers/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 1224, in forward
    layer_outputs = decoder_layer(
  File "/cpfs01/shared/public/xingzhang.rxz/anaconda3/envs/qwen_moe/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/data/roy.zb/workspace/transformers/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 934, in forward
    hidden_states = self.mlp(hidden_states)
  File "/cpfs01/shared/public/xingzhang.rxz/anaconda3/envs/qwen_moe/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/data/roy.zb/workspace/transformers/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 856, in forward
    final_hidden_states.index_add_(0, top_x, current_hidden_states.to(hidden_states.dtype))
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

The code to reproduce to error is here:

from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4")

prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

The error information is here:

Traceback (most recent call last):
  File "/home/data/roy.zb/workspace/test_auto_gptq.py", line 23, in <module>
    generated_ids = model.generate(
  File "/cpfs01/shared/public/xingzhang.rxz/anaconda3/envs/qwen_moe/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/data/roy.zb/workspace/transformers/src/transformers/generation/utils.py", line 1656, in generate
    result = self._sample(
  File "/home/data/roy.zb/workspace/transformers/src/transformers/generation/utils.py", line 2819, in _sample
    outputs = self(
  File "/cpfs01/shared/public/xingzhang.rxz/anaconda3/envs/qwen_moe/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/data/roy.zb/workspace/transformers/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 1355, in forward
    outputs = self.model(
  File "/cpfs01/shared/public/xingzhang.rxz/anaconda3/envs/qwen_moe/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/data/roy.zb/workspace/transformers/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 1224, in forward
    layer_outputs = decoder_layer(
  File "/cpfs01/shared/public/xingzhang.rxz/anaconda3/envs/qwen_moe/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/data/roy.zb/workspace/transformers/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 934, in forward
    hidden_states = self.mlp(hidden_states)
  File "/cpfs01/shared/public/xingzhang.rxz/anaconda3/envs/qwen_moe/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/data/roy.zb/workspace/transformers/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 856, in forward
    final_hidden_states.index_add_(0, top_x, current_hidden_states.to(hidden_states.dtype))
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Expected behavior

Output the following text:

A large language model is a type of artificial intelligence that is trained to understand and generate human language. These models are designed to process and comprehend natural language input, and can be used for a variety of tasks such as language translation, sentiment analysis, and chatbot development. They are typically very large neural networks that have been pre-trained on vast amounts of text data, allowing them to learn the nuances of language and make intelligent predictions about how to respond to different inputs. Large language models have become increasingly popular in recent years due to their ability to handle complex language tasks and their potential applications in fields such as customer service, content creation, and education.

The text was updated successfully, but these errors were encountered:

amyeroberts · 2024-04-29T09:30:38Z

cc @younesbelkada @SunMarc

SunMarc · 2024-04-30T11:45:40Z

Hi @bozheng-hit, thanks for reporting ! I can indeed reproduce the error and it also happens to the mixtral models. I'm not sure what would be the best fix for now since adding back the top_x.shape[0] == 0: condition will break the fx tracing for qwen moe and mixtral and inference works fine on the original model. WDYT @amyeroberts @ArthurZucker ? LMK if you come up with a solution @bozheng-hit, I will try to find a solution too.

amyeroberts · 2024-04-30T11:53:33Z

@SunMarc Would having a conditional check e.g. if not is_tracing() and top_x.shape[0] == 0: work as a partial fix?

SunMarc · 2024-04-30T12:07:46Z

Thanks for the tip @amyeroberts but it doesn't work ;) . However, I tested the exllamav2 kernel and it works with it. The exllamav1 kernel must have some issues. A potential fix would be to change the quantization_config inside the config.json, so that users uses exllamav2 kernel by default. WDYT @bozheng-hit ? You would have to set version to 2 in exllama_config field here.

jklj077 mentioned this issue May 8, 2024

Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4 demo encount: RuntimeError: CUDA error: invalid configuration argument QwenLM/Qwen1.5#385

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference bug of the MoE GPTQ models #30515

Inference bug of the MoE GPTQ models #30515

bozheng-hit commented Apr 27, 2024

amyeroberts commented Apr 29, 2024

SunMarc commented Apr 30, 2024 •

edited

amyeroberts commented Apr 30, 2024

SunMarc commented Apr 30, 2024

Inference bug of the MoE GPTQ models #30515

Inference bug of the MoE GPTQ models #30515

Comments

bozheng-hit commented Apr 27, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

amyeroberts commented Apr 29, 2024

SunMarc commented Apr 30, 2024 • edited

amyeroberts commented Apr 30, 2024

SunMarc commented Apr 30, 2024

SunMarc commented Apr 30, 2024 •

edited