Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference bug of the MoE GPTQ models #30515

Open
2 of 4 tasks
bozheng-hit opened this issue Apr 27, 2024 · 4 comments
Open
2 of 4 tasks

Inference bug of the MoE GPTQ models #30515

bozheng-hit opened this issue Apr 27, 2024 · 4 comments

Comments

@bozheng-hit
Copy link
Contributor

System Info

Generating with GPTQ models encounters the following errors after merging this PR: #30209 @younesbelkada @SunMarc

The error information is here, and the model successfully generates after I revert the change for modeling_qwen2_moe.py.

Traceback (most recent call last):
  File "/home/data/roy.zb/workspace/test_auto_gptq.py", line 23, in <module>
    generated_ids = model.generate(
  File "/cpfs01/shared/public/xingzhang.rxz/anaconda3/envs/qwen_moe/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/data/roy.zb/workspace/transformers/src/transformers/generation/utils.py", line 1656, in generate
    result = self._sample(
  File "/home/data/roy.zb/workspace/transformers/src/transformers/generation/utils.py", line 2819, in _sample
    outputs = self(
  File "/cpfs01/shared/public/xingzhang.rxz/anaconda3/envs/qwen_moe/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/data/roy.zb/workspace/transformers/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 1355, in forward
    outputs = self.model(
  File "/cpfs01/shared/public/xingzhang.rxz/anaconda3/envs/qwen_moe/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/data/roy.zb/workspace/transformers/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 1224, in forward
    layer_outputs = decoder_layer(
  File "/cpfs01/shared/public/xingzhang.rxz/anaconda3/envs/qwen_moe/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/data/roy.zb/workspace/transformers/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 934, in forward
    hidden_states = self.mlp(hidden_states)
  File "/cpfs01/shared/public/xingzhang.rxz/anaconda3/envs/qwen_moe/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/data/roy.zb/workspace/transformers/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 856, in forward
    final_hidden_states.index_add_(0, top_x, current_hidden_states.to(hidden_states.dtype))
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

The code to reproduce to error is here:

from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4")

prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

The error information is here:

Traceback (most recent call last):
  File "/home/data/roy.zb/workspace/test_auto_gptq.py", line 23, in <module>
    generated_ids = model.generate(
  File "/cpfs01/shared/public/xingzhang.rxz/anaconda3/envs/qwen_moe/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/data/roy.zb/workspace/transformers/src/transformers/generation/utils.py", line 1656, in generate
    result = self._sample(
  File "/home/data/roy.zb/workspace/transformers/src/transformers/generation/utils.py", line 2819, in _sample
    outputs = self(
  File "/cpfs01/shared/public/xingzhang.rxz/anaconda3/envs/qwen_moe/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/data/roy.zb/workspace/transformers/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 1355, in forward
    outputs = self.model(
  File "/cpfs01/shared/public/xingzhang.rxz/anaconda3/envs/qwen_moe/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/data/roy.zb/workspace/transformers/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 1224, in forward
    layer_outputs = decoder_layer(
  File "/cpfs01/shared/public/xingzhang.rxz/anaconda3/envs/qwen_moe/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/data/roy.zb/workspace/transformers/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 934, in forward
    hidden_states = self.mlp(hidden_states)
  File "/cpfs01/shared/public/xingzhang.rxz/anaconda3/envs/qwen_moe/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/data/roy.zb/workspace/transformers/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 856, in forward
    final_hidden_states.index_add_(0, top_x, current_hidden_states.to(hidden_states.dtype))
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Expected behavior

Output the following text:

A large language model is a type of artificial intelligence that is trained to understand and generate human language. These models are designed to process and comprehend natural language input, and can be used for a variety of tasks such as language translation, sentiment analysis, and chatbot development. They are typically very large neural networks that have been pre-trained on vast amounts of text data, allowing them to learn the nuances of language and make intelligent predictions about how to respond to different inputs. Large language models have become increasingly popular in recent years due to their ability to handle complex language tasks and their potential applications in fields such as customer service, content creation, and education.

@amyeroberts
Copy link
Collaborator

cc @younesbelkada @SunMarc

@SunMarc
Copy link
Member

SunMarc commented Apr 30, 2024

Hi @bozheng-hit, thanks for reporting ! I can indeed reproduce the error and it also happens to the mixtral models. I'm not sure what would be the best fix for now since adding back the top_x.shape[0] == 0: condition will break the fx tracing for qwen moe and mixtral and inference works fine on the original model. WDYT @amyeroberts @ArthurZucker ? LMK if you come up with a solution @bozheng-hit, I will try to find a solution too.

@amyeroberts
Copy link
Collaborator

@SunMarc Would having a conditional check e.g. if not is_tracing() and top_x.shape[0] == 0: work as a partial fix?

@SunMarc
Copy link
Member

SunMarc commented Apr 30, 2024

Thanks for the tip @amyeroberts but it doesn't work ;) . However, I tested the exllamav2 kernel and it works with it. The exllamav1 kernel must have some issues. A potential fix would be to change the quantization_config inside the config.json, so that users uses exllamav2 kernel by default. WDYT @bozheng-hit ? You would have to set version to 2 in exllama_config field here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants