-
Notifications
You must be signed in to change notification settings - Fork 8.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml-qnn: add Qualcomm QNN(Qualcomm Neural Network,aka Qualcomm AI Engine Direct) backend #6869
base: master
Are you sure you want to change the base?
Conversation
59e42f8
to
b0c3013
Compare
Nice. With competent LLMs getting smaller and more efficient as well as Snapdragon laptops coming soon, it's important to make full use of the AI acceleration these SoCs provide through the Hexagon NPU Cluster. This will make llama.cpp a robust backend for the future and will lead to power efficient LLMs on the go. Personally, I really can't wait! |
thanks for your comment. this PR is a very initial implementation and could/might/should be a good starting point of Qualcomm's QNN backend for GGML. it's better some domain technical experts from Qualcomm involved in this effort after it's accepted by community. I personally think this PR is also an example of GGML way: try crazy ideas, build wild demos, and push the edge of what’s possible. another thing, a small and standalone Android example(or re-use the existing Android example in llama.cpp) is needed for purpose of facilitate community developers to participate in develop/verify QNN backend. |
8ade7aa
to
f9e1b9a
Compare
5abb2e4
to
7a420e1
Compare
95a980a
to
b0c3013
Compare
Yes, it would be useful to have an example or instructions how to run this. In the meantime, simply setting up the |
thanks for your guidance. I'll study how to use test-backend-ops.cpp to validate QNN backend. |
You would need to modify Line 411 in 5477041
|
thanks for your help, it's really helpful. I'm working on adapt to test-backend-ops.cpp with QNN backend on Android. |
@ggerganov, @slaren, I'm sorry to interrupt you. adapt to test-backend-ops.cpp using QNN backend already done and it works fine as expected on Xiaomi14(Qualcomm SM8650-AB Snapdragon 8 Gen 3). Could you take a moment to look at it? thanks. BTW, the design and implementation of test-backend-ops.cpp is really excellent. I never noticed this file/feature before. BTW, should the README-qnn.md be removed? |
eff9669
to
180ab5f
Compare
992cf05
to
67beeb6
Compare
…ing to review comments
375b5e5
to
fdf0272
Compare
dafa5f1
to
3e8b61f
Compare
a98a4e9
to
d38d4a6
Compare
Thanks for the fix, good job! Now working on running this branch on my phone! Will leave a note here if have any problem! |
9e1009c
to
5f8cfe4
Compare
5a65c86
to
5269e08
Compare
c42d045
to
faaa86b
Compare
qnn_instance * instance = nullptr; | ||
std::string graph_name = "ggml_op_qnn_add"; | ||
Qnn_GraphHandle_t graph_handle = nullptr; | ||
Qnn_Tensor_t * tensor_0 = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created a PR on your fork, to simpilify the binding from Qnn_Tensor_t
to ggml_tensor
, please have look if have time: zhouwg#2
* mul_mat_f16_f32: src0 is F16 and src1 is F32. | ||
* mul_mat_q_f32: src0 is quantized (Q4_0, Q4_1, ...), and src1 is F32. | ||
*/ | ||
static void ggml_qnn_mul_mat(ggml_backend_qnn_context * ctx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks the graphExecute
failed with error 6004
. maybe we can use it to find the root cause here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to reproduce, you could use my patch to constant initialize the test tensor:
llama.cpp-5e18cdc-init the test array with const values.patch
just change the tensor init in the unit test so that we can reproduce it more easily
i'm tred build in termux. |
Self Reported Review Complexity
Purpose
Android maintained its position as the leading mobile operating system worldwide in the fourth quarter of 2023 with a market share of 70.1 percent .
Qualcomm is No.1 mobile SoC semiconductor company in our planet currently(MediaTek's market share is No.1 in Q1 2024 but I personally think Qualcomm is the real No.1 mobile SoC vendor in our planet). Hexagon NPU in Qualcomm Snapdragon 8 Gen 3 was designed for generative AI and delivering 98% faster performance and 40% improved performance-per-watt for sustained AI inferencing, it make the Hexagon NPU the leading processor for on-device AI inferencing.
QNN(Qualcomm Neural Network, aka Qualcomm AI Engine Direct) SDK is verified to work with the following versions of the ML frameworks:
As a very compact/highly well-designed/highly optimization/highly performance C/C++ machine learning framework/library, this PR aims to add Qualcomm's QNN backend for ggml and focus on this accordingly:how to utilize the Hexagon NPU maximally with the highly well-designed/highly compact ggml machine learning framework.
Status
Data path works fine as expected with whisper.cpp and llama.cpp using QNN backend and verified on both low-end and high-end Android phones based on Qualcomm mobile SoC.
4x performance gains for GGML_OP_MUL_MAT using QNN CPU backend with 1 thread on a Qualcomm mobile SoC equipped high-end Android phone(a flagship Qualcomm Snapdragon 8 Gen 3 mobile SoC which released on Oct 2023). The performance of GGML_OP_MUL_MAT might/should/would be improved much more using QNN NPU(aka Hexagon Tensor Processor) backend after we know the secrets(QNN RPC, multithreading in NPU backend......) of Qualcomm's NPU(aka Hexagon Tensor Processor).
A dedicated Android command line program (for purpose of UT) works fine as expected on Qualcomm SM8650-AB Snapdragon 8 Gen 3 equipped high-end Android phone and other Qualcomm's low-end mobile SoC equipped low-end Android phone(QNN NPU backend not works on Qualcomm low-end Android phone).
QNN's RPC feature(which useful for QNN NPU(aka HTP/DSP) backend) was used in this PR and it works fine as expected.there are 2+GBytes ion memory could be used for offload ggml tensors in cgraph to NPU on Qualcomm Snapdragon 8 Gen 3 equipped Android phone.
This PR is a Minimum Viable PR style and functional PR in ggml community. it'll be great helpful for other community programmer/developer/AI expert to contribute codes/ideas to GGML QNN backend if this PR can be approved and merged to master branch. Together we might/should/could reach the final target: utilize the Hexagon NPU maximally with the highly well-designed/highly compact ggml machine learning framework. this is might be the exact GGML way in GGML community.
Todo
Qualcomm's QNN backend for GGML has some todo tasks to make this backend can be used in real commercial application:Lack of implementation of other GGML-OPs using QNN API. I provide a GENERAL approach try to fix this problem in a standalone PR of refine ggml backend subsystem for mixed inference between CPU&GPU / CPU&NPU easily for ANY ggml backends(which the backend's ggml_backend_xxx_buffer_is_host return true) . this approach works as expected with whisper inference and llama inference in my personal ggml learning&study project.
Add more quantize data type supportive(AI expert should be here)
Peformance fine-tunning: the performance of the existing ggml qnn backend is weaker/poor then the original ggml because there are some sophisticated Qualcomm's dedicated technologies not used in this PR and the power of state-of-the-art Qualcomm's NPU(Hexagon Tensor Processor) was not utilized currently in this PR(I know the direction but limited by my knowledge of real/hardcore AI tech). The performance fine-tunning in ggml gnn-npu backend is a long-term task. the following is an example:
How to verify QNN backend or participate in development activity of GGML QNN backend
I provide a dedicated Android command line program and scripts in this PR for purpose of UT on Android device.
A suitable/qualified reviewer should/might be familiar with source code of ggml and Qualcomm QNN(Qualcomm Neural Network, aka Qualcomm AI Engine Direct) SDK or other Qualcomm's AI software stack; skillsets including real/hardcore AI tech is more better(add more quantize data type and implement more GGML OPs(or kernels) require the AI skillset) but is not an essential skillset in this PR. some notes for potential qualified reviewer:
Any GGML community programmer/developer/AI expert who interesting with the topic of GGML QNN backend can use/extend the dedicated Android command line program to verify GGML QNN backend, review are greatly welcomed and appreciated.