Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4 demo encount: RuntimeError: CUDA error: invalid configuration argument #385

Closed
MasterYi1024 opened this issue May 8, 2024 · 6 comments

Comments

@MasterYi1024
Copy link

Hi :)

I'm running Qwen MoE demo code from qwen blog, but get this error:

2024-05-08 13:49:59.192500: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-05-08 13:49:59.470650: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-08 13:49:59.470769: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-08 13:49:59.509938: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-05-08 13:49:59.596210: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-05-08 13:50:00.140439: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
True
2024-05-08 13:50:01.110073: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-05-08 13:50:01.282622: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-05-08 13:50:01.283040: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
/home/kylin/ai/downloads/transformers/src/transformers/modeling_utils.py:4390: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
  warnings.warn(
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:10<00:00,  3.61s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
  File "/home/kylin/ai/python/qwen.py", line 113, in <module>
    generated_ids = model.generate(
  File "/home/kylin/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/kylin/ai/downloads/transformers/src/transformers/generation/utils.py", line 1679, in generate
    result = self._sample(
  File "/home/kylin/ai/downloads/transformers/src/transformers/generation/utils.py", line 2468, in _sample
    outputs = self(
  File "/home/kylin/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/kylin/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/kylin/ai/downloads/transformers/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 1350, in forward
    outputs = self.model(
  File "/home/kylin/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/kylin/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/kylin/ai/downloads/transformers/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 1219, in forward
    layer_outputs = decoder_layer(
  File "/home/kylin/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/kylin/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/kylin/ai/downloads/transformers/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 929, in forward
    hidden_states = self.mlp(hidden_states)
  File "/home/kylin/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/kylin/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/kylin/ai/downloads/transformers/src/transformers/models/qwen2_moe/modeling_qwen2_moe.py", line 851, in forward
    final_hidden_states.index_add_(0, top_x, current_hidden_states.to(hidden_states.dtype))
RuntimeError: CUDA error: invalid configuration argument
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

some of my device infomations:

# gpu
RTX 4090 24G

# pip packages
llama_cpp_python_cuda_tensorcores 0.2.43+cu121
safetensors                       0.4.3
tensorboard                       2.16.2
tensorboard-data-server           0.7.2
tensorflow                        2.15.0.post1
tensorflow-estimator              2.15.0
tensorflow-io-gcs-filesystem      0.36.0
tensorrt                          10.0.1
tensorrt-cu12                     10.0.1
tensorrt-cu12-bindings            10.0.1
tensorrt-cu12-libs                10.0.1
tensorrt-dispatch                 10.0.1
tensorrt-lean                     10.0.1
llama_cpp_python_cuda             0.2.43+cu121
llama_cpp_python_cuda_tensorcores 0.2.43+cu121
nvidia-cuda-cupti-cu11            11.7.101
nvidia-cuda-cupti-cu12            12.1.105
nvidia-cuda-nvrtc-cu11            11.7.99
nvidia-cuda-nvrtc-cu12            12.1.105
nvidia-cuda-runtime-cu11          11.7.99
nvidia-cuda-runtime-cu12          12.1.105
pycuda                            2024.1
nvidia-cudnn-cu12                 8.9.2.26
transformers                      4.41.0.dev0       /home/kylin/ai/downloads/transformers

I have searched for a long time, but no luck. Could anyone help? Thanks a lot:)

code:

from transformers import AutoModelForCausalLM, AutoTokenizer

import os
os.environ["CUDA_LAUNCH_BLOCKING"] = '1'

model = AutoModelForCausalLM.from_pretrained(
    "/home/kylin/ai/models/Qwen/MoE/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("/home/kylin/ai/models/Qwen/MoE/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4")

prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
@jklj077
Copy link
Contributor

jklj077 commented May 8, 2024

From @bozheng-hit, huggingface/transformers@304c6a1 breaks the GPTQ of Qwen1.5-MoE. Please try an earlier snapshot of the transformers repo.

@jklj077
Copy link
Contributor

jklj077 commented May 8, 2024

Please also see this issue at transformers: huggingface/transformers#30515

@MasterYi1024
Copy link
Author

Thank you so much @jklj077 , I reverted that pr and it worked.

This is a response of just a "hello":

start:  1715160502.3874416
generated_ids.shape:  torch.Size([1, 52])
generated_ids:  tensor([[151644,   8948,    198,   2610,    525,    264,  10950,  17847, 151645,
            198, 151644,    872,    198,  14990, 151645,    198, 151644,  77091,
            198,   9707,      0,   2585,    646,    358,   1492,    498,   3351,
             30,   2160,   1052,   2494,   3151,    498,   1035,   1075,    311,
           1414,    476,   4263,     30,    358,   2776,   1588,    311,   4226,
            894,   4755,    498,   2578,    614,     13, 151645]],
       device='cuda:0')
generated:  1715160505.0855026
generate time:  2.698057174682617
generated_ids2:  [tensor([  9707,      0,   2585,    646,    358,   1492,    498,   3351,     30,
          2160,   1052,   2494,   3151,    498,   1035,   1075,    311,   1414,
           476,   4263,     30,    358,   2776,   1588,    311,   4226,    894,
          4755,    498,   2578,    614,     13, 151645], device='cuda:0')]
generated2:  1715160505.0857866
decoded:  1715160505.0859225
Hello! How can I help you today? Is there something specific you would like to know or discuss? I'm here to answer any questions you might have.

It took about 2.7s to generate 52 tokens on RTX4090, is this right? I think it seems a little slow.

@MasterYi1024
Copy link
Author

This is a response from Qwen1.5-4B-Chat:

> hello
cuda:0
start:  1715216908.1639924
generated_ids.shape:  torch.Size([1, 59])
generated:  1715216909.3518546
generate time:  1.1878581047058105
generated2:  1715216909.3518755
decoded:  1715216909.3519962
Hello! How can I help you today? Is there something specific you would like to know or discuss? I'm here to answer any questions you might have. feel free to ask me anything.

It took about 1.2s to generate 59 tokens on RTX4090. Twice as fast than Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4 model.

@jklj077
Copy link
Contributor

jklj077 commented May 9, 2024

Hi, thanks for sharing your perf results!

We believe that evaluating performance through metrics such as speed can prove challenging due to the multitude of influencing factors.

  1. Quantization Impact: Typically, quantization introduces processing overhead due to the necessity of dequantization. However, it may yield benefits when it significantly reduces memory bandwidth requirements and is implemented efficiently. The primary advantage lies in its reduction of memory footprint, enabling the handling of more concurrent requests and thereby potentially enhancing overall throughput.
  2. Implementation Variability: Given the distinct backend implementations within transformers – with attention mechanisms possibly utilizing the SDPA approach and GPTQ potentially leveraging exllama v2 – direct comparisons face complexity. Further, models like 4B and MoE-A2.7B, differing in scale, add another layer of intricacy to assessing the reasonableness of comparative outcomes.

For a performance assessment, we recommend referring to our results at https://qwenlm.github.io/blog/qwen-moe/#costs-and-efficiency. Here, we employ vllm using Qwen1.5-7B-Chat and Qwen1.5-MoE-A2.7B-Chat, where the impact of quantization and implementation is ruled out for a relatively fair comparison.

While it provides a foundation for understanding performance, real-world applications often necessitate nuanced decision-making. Factors such as the specific hardware configuration and the availability of optimized software implementations can significantly influence the actual performance of a model. Users are encouraged to make informed decisions tailored to their unique requirements and constraints, ensuring optimal performance in their specific reality.

@MasterYi1024
Copy link
Author

Thank you for your reply:)

I will check this out on my own.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants