ggml-qnn: add Qualcomm QNN(Qualcomm Neural Network,aka Qualcomm AI Engine Direct) backend #6869

zhouwg · 2024-04-24T08:29:34Z

Self Reported Review Complexity

Review Complexity : Low
Review Complexity : Medium
Review Complexity : High
I have read the contributing guidelines

Purpose

Android maintained its position as the leading mobile operating system worldwide in the fourth quarter of 2023 with a market share of 70.1 percent .

Qualcomm is No.1 mobile SoC semiconductor company in our planet currently(MediaTek's market share is No.1 in Q1 2024 but I personally think Qualcomm is the real No.1 mobile SoC vendor in our planet). Hexagon NPU in Qualcomm Snapdragon 8 Gen 3 was designed for generative AI and delivering 98% faster performance and 40% improved performance-per-watt for sustained AI inferencing, it make the Hexagon NPU the leading processor for on-device AI inferencing.

QNN(Qualcomm Neural Network, aka Qualcomm AI Engine Direct) SDK is verified to work with the following versions of the ML frameworks:

TensorFlow: tf-1.15.0, or tf-2.10.1
TFLite: tflite-2.3.0
PyTorch: torch-1.13.1
ONNX: onnx-1.11.0

As a very compact/highly well-designed/highly optimization/highly performance C/C++ machine learning framework/library, this PR aims to add Qualcomm's QNN backend for ggml and focus on this accordingly:how to utilize the Hexagon NPU maximally with the highly well-designed/highly compact ggml machine learning framework.

Status

Data path works fine as expected with whisper.cpp and llama.cpp using QNN backend and verified on both low-end and high-end Android phones based on Qualcomm mobile SoC.

4x performance gains for GGML_OP_MUL_MAT using QNN CPU backend with 1 thread on a Qualcomm mobile SoC equipped high-end Android phone(a flagship Qualcomm Snapdragon 8 Gen 3 mobile SoC which released on Oct 2023). The performance of GGML_OP_MUL_MAT might/should/would be improved much more using QNN NPU(aka Hexagon Tensor Processor) backend after we know the secrets(QNN RPC, multithreading in NPU backend......) of Qualcomm's NPU(aka Hexagon Tensor Processor).

A dedicated Android command line program (for purpose of UT) works fine as expected on Qualcomm SM8650-AB Snapdragon 8 Gen 3 equipped high-end Android phone and other Qualcomm's low-end mobile SoC equipped low-end Android phone(QNN NPU backend not works on Qualcomm low-end Android phone).

/data/local/tmp//libQnnCpu.so
QNN libs already exist on Android phone
ggml-qnn-test: 1 file pushed. 16.3 MB/s (4567168 bytes in 0.267s)
[main, 344]: enter qnn_ggml_op

[main, 345]: ggml op:2(ADD)
[main, 359]: Allocating Memory of size 33554432 bytes, 32 MB

[ggml_backend_qnn_init, 3955]: device 0
[ggml_backend_qnn_init, 3956]: qnn_lib_path /data/local/tmp/
[qnn_init, 2172]: enter qni_init

[load_system, 2033]: system_lib_path:/data/local/tmp/libQnnSystem.so

[load_system, 2082]: find a valid qnn system interface

[load_system, 2092]: initialize qnn system successfully

[qnn_init, 2180]: load QNN system lib successfully

[load_backend, 1911]: lib_path:/data/local/tmp/libQnnCpu.so

[load_backend, 1935]: num_providers=1

[load_backend, 1960]: find a valid qnn interface

[load_backend, 2005]: saver_initialize is null

[qnn_init, 2213]: initialize qnn log successfully

[qnn_init, 2224]: initialize qnn backend successfully

[qnn_init, 2230]: device property is not supported

[qnn_init, 2241]: create device successfully

[qnn_init, 2245]: profiling turned on; level = 2
[qnn_init, 2256]: detailed profiling requested. Creating Qnn Profile object

[qnn_init, 2262]: initialize qnn profile successfully

[qnn_init, 2272]: load rpcmem lib successfully

[qnn_init, 2299]: initialize qnn context successfully

[qnn_init, 2302]: leave qni_init

[ggml_backend_qnn_init, 4011]: qnn device name QNN-CPU
[init_qnn_graph, 2406]: succeed to create graph QNN-CPU, 0xd4a54a2a43bcdc2f

[main, 395]: creating new tensors

[main, 396]: ggml_blck_size(f32) 1
[main, 397]: ggml_type_size(f32) 4
[main, 436]: creating backend buffer

[main, 448]: creating compute graph

[ggml_qnn_can_handle_op, 2458]: op name:ADD, tensor type:f32
[ggml_qnn_can_handle_op, 2460]: src0 type:f32
[ggml_qnn_can_handle_op, 2463]: src1 type:f32
[ggml_qnn_add, 2574]: call ggml_qnn_add

[ggml_qnn_add, 2578]:        tensor_0: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)

[ggml_qnn_add, 2582]:        tensor_1: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)

[ggml_qnn_add, 2586]:        tensor_2: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)

[ggml_qnn_add, 2587]: 4, 4, 1, 1
[ggml_qnn_add, 2588]: tensor0 name tensor_0
[ggml_qnn_add, 2589]: tensor1 name tensor_1
[ggml_qnn_add, 2590]: tensor2 name tensor_2
[ggml_qnn_add, 2617]: graph name ggml_op_qnn_add_1tensor_0_tensor_1
[ggml_qnn_logcallback, 2165]:     11.5ms [ DEBUG ] getNode OpPackage-Name : qti.aisw Node-Type : ElementWiseAdd 
[ggml_qnn_logcallback, 2165]:     11.5ms [VERBOSE] validate	Node-Type : ElementWiseAdd	Node-Name : ggml_op_add 
[ggml_qnn_logcallback, 2165]:     11.7ms [  INFO ] CpuGraph::finalize 
[ggml_qnn_logcallback, 2165]:     11.7ms [ DEBUG ] Setting data pointer for tensor ID: 1 
[ggml_qnn_logcallback, 2165]:     11.7ms [ DEBUG ] Setting data pointer for tensor ID: 2 
[ggml_qnn_logcallback, 2165]:     11.7ms [ DEBUG ] Setting data pointer for tensor ID: 3 
[ggml_qnn_logcallback, 2165]:     11.7ms [  INFO ] CpuGraph::execute 
[get_tensor_rank, 210]: tensor->rank 4

[get_tensor_rank, 211]: get_tensor_rank 2

[get_tensor_data_size, 223]: get_tensor_data_size 64
[get_tensor_data_size, 224]: ggml_nbytes(tensor) 64
[main, 464]: dump:

[tensor_dump, 191]: dump ggml tensor src0(tensor_0)
[tensor_dump, 195]:            src0: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_sum_elements, 151]:    -0.84     0.23    -0.07    -0.25 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:     0.10    -0.32    -0.96     0.28 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:    -0.63    -0.59     0.29    -1.00 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:    -0.01     0.10     0.92     0.54 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 185]: 

[tensor_dump, 198]: 

[tensor_dump, 191]: dump ggml tensor src1(tensor_1)
[tensor_dump, 195]:            src1: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_sum_elements, 151]:     0.99    -0.43    -0.41    -0.44 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:    -0.06     0.64    -0.61    -0.98 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:    -0.86    -0.11     0.41     0.27 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:     0.54    -0.70    -0.90    -0.13 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 185]: 

[tensor_dump, 198]: 

[tensor_dump, 191]: dump ggml tensor dst(tensor_2)
[tensor_dump, 195]:             dst: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_sum_elements, 151]:     0.15    -0.19    -0.48    -0.69 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:     0.04     0.32    -1.57    -0.70 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:    -1.49    -0.70     0.70    -0.73 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:     0.53    -0.60     0.02     0.42 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 185]: 

[tensor_dump, 198]: 

[ggml_backend_qnn_free, 3753]: enter ggml_backend_qnn_free
[ggml_backend_qnn_free, 3755]: idx 0, name:qnn-cpu
[ggml_backend_qnn_free, 3764]: graph type:ADD
[qnn_finalize, 2318]: succeed to close rpcmem lib

[ggml_backend_qnn_free, 3786]: leave ggml_backend_qnn_free

[ggml_backend_qnn_init, 3955]: device 0
[ggml_backend_qnn_init, 3956]: qnn_lib_path /data/local/tmp/
[qnn_init, 2172]: enter qni_init

[load_system, 2033]: system_lib_path:/data/local/tmp/libQnnSystem.so

[load_system, 2082]: find a valid qnn system interface

[load_system, 2092]: initialize qnn system successfully

[qnn_init, 2180]: load QNN system lib successfully

[load_backend, 1911]: lib_path:/data/local/tmp/libQnnCpu.so

[load_backend, 1935]: num_providers=1

[load_backend, 1960]: find a valid qnn interface

[load_backend, 2005]: saver_initialize is null

[qnn_init, 2213]: initialize qnn log successfully

[qnn_init, 2224]: initialize qnn backend successfully

[qnn_init, 2230]: device property is not supported

[qnn_init, 2241]: create device successfully

[qnn_init, 2245]: profiling turned on; level = 2
[qnn_init, 2256]: detailed profiling requested. Creating Qnn Profile object

[qnn_init, 2262]: initialize qnn profile successfully

[qnn_init, 2272]: load rpcmem lib successfully

[qnn_init, 2299]: initialize qnn context successfully

[qnn_init, 2302]: leave qni_init

[ggml_backend_qnn_init, 4011]: qnn device name QNN-CPU
[init_qnn_graph, 2406]: succeed to create graph QNN-CPU, 0xd4a54a5b40bcdc2f

[main, 395]: creating new tensors

[main, 396]: ggml_blck_size(f32) 1
[main, 397]: ggml_type_size(f32) 4
[main, 436]: creating backend buffer

[main, 448]: creating compute graph

[ggml_qnn_can_handle_op, 2458]: op name:MUL, tensor type:f32
[ggml_qnn_can_handle_op, 2460]: src0 type:f32
[ggml_qnn_can_handle_op, 2463]: src1 type:f32
[ggml_qnn_hanlde_op, 2993]: call ggml_qnn_hanlde_op

[ggml_qnn_hanlde_op, 2997]:        tensor_0: type = 0 (  f32)  ne =     4 x     4 x     1, nb = (    4,    16,    64)

[ggml_qnn_hanlde_op, 3001]:        tensor_1: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)

[ggml_qnn_hanlde_op, 3005]:        tensor_2: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)

[ggml_qnn_hanlde_op, 3006]: 4, 4, 1, 1
[ggml_qnn_hanlde_op, 3007]: tensor0 name tensor_0
[ggml_qnn_hanlde_op, 3008]: tensor1 name tensor_1
[ggml_qnn_hanlde_op, 3009]: tensor2 name tensor_2
[ggml_qnn_hanlde_op, 3033]: qnn graph name ggml_qnn_graph_MUL1tensor_0_tensor_1
[ggml_qnn_hanlde_op, 3034]: qnn op_config name ggml_qnn_op_config_MUL1tensor_0_tensor_1
[ggml_qnn_logcallback, 2165]:     17.7ms [ DEBUG ] getNode OpPackage-Name : qti.aisw Node-Type : ElementWiseMultiply 
[ggml_qnn_logcallback, 2165]:     17.8ms [VERBOSE] validate	Node-Type : ElementWiseMultiply	Node-Name : ggml_qnn_op_config_MUL1tensor_0_tensor_1 
[ggml_qnn_logcallback, 2165]:     18.0ms [  INFO ] CpuGraph::finalize 
[ggml_qnn_logcallback, 2165]:     18.1ms [ DEBUG ] Setting data pointer for tensor ID: 1 
[ggml_qnn_logcallback, 2165]:     18.1ms [ DEBUG ] Setting data pointer for tensor ID: 2 
[ggml_qnn_logcallback, 2165]:     18.1ms [ DEBUG ] Setting data pointer for tensor ID: 3 
[ggml_qnn_logcallback, 2165]:     18.1ms [  INFO ] CpuGraph::execute 
[ggml_qnn_hanlde_op, 3134]: duration of ggml_qnn_MUL : 0 milliseconds

[ggml_qnn_hanlde_op, 3135]: call ggml_qnn_hanlde_op done

[get_tensor_rank, 210]: tensor->rank 4

[get_tensor_rank, 211]: get_tensor_rank 2

[get_tensor_data_size, 223]: get_tensor_data_size 64
[get_tensor_data_size, 224]: ggml_nbytes(tensor) 64
[main, 464]: dump:

[tensor_dump, 191]: dump ggml tensor src0(tensor_0)
[tensor_dump, 195]:            src0: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_sum_elements, 151]:    -0.62     0.59    -0.34     0.40 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:    -0.81     0.33     0.52     0.01 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:    -0.37     0.43     0.97     0.06 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:     0.28     0.09    -0.57    -0.02 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 185]: 

[tensor_dump, 198]: 

[tensor_dump, 191]: dump ggml tensor src1(tensor_1)
[tensor_dump, 195]:            src1: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_sum_elements, 151]:     0.24    -0.57    -0.17     0.36 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:    -0.83    -0.64     0.23    -0.87 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:    -0.25    -0.31     0.55     0.64 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:    -0.42     0.42     0.96     0.88 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 185]: 

[tensor_dump, 198]: 

[tensor_dump, 191]: dump ggml tensor dst(tensor_2)
[tensor_dump, 195]:             dst: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_sum_elements, 151]:    -0.15    -0.34     0.06     0.14 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:     0.67    -0.21     0.12    -0.01 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:     0.09    -0.13     0.53     0.04 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:    -0.12     0.04    -0.55    -0.01 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 185]: 

[tensor_dump, 198]: 

[ggml_backend_qnn_free, 3753]: enter ggml_backend_qnn_free
[ggml_backend_qnn_free, 3755]: idx 0, name:qnn-cpu
[ggml_backend_qnn_free, 3764]: graph type:MUL
[qnn_finalize, 2318]: succeed to close rpcmem lib

[ggml_backend_qnn_free, 3786]: leave ggml_backend_qnn_free

/data/local/tmp//libQnnCpu.so
QNN libs already exist on Android phone
ggml-qnn-test: 1 file pushed. 20.3 MB/s (4567168 bytes in 0.215s)
[main, 344]: enter qnn_ggml_op

[main, 345]: ggml op:23(MUL_MAT)
[main, 359]: Allocating Memory of size 33554432 bytes, 32 MB

[ggml_backend_qnn_init, 3955]: device 0
[ggml_backend_qnn_init, 3956]: qnn_lib_path /data/local/tmp/
[qnn_init, 2172]: enter qni_init

[load_system, 2033]: system_lib_path:/data/local/tmp/libQnnSystem.so

[load_system, 2082]: find a valid qnn system interface

[load_system, 2092]: initialize qnn system successfully

[qnn_init, 2180]: load QNN system lib successfully

[load_backend, 1911]: lib_path:/data/local/tmp/libQnnCpu.so

[load_backend, 1935]: num_providers=1

[load_backend, 1960]: find a valid qnn interface

[load_backend, 2005]: saver_initialize is null

[qnn_init, 2213]: initialize qnn log successfully

[qnn_init, 2224]: initialize qnn backend successfully

[qnn_init, 2230]: device property is not supported

[qnn_init, 2241]: create device successfully

[qnn_init, 2245]: profiling turned on; level = 2
[qnn_init, 2256]: detailed profiling requested. Creating Qnn Profile object

[qnn_init, 2262]: initialize qnn profile successfully

[qnn_init, 2272]: load rpcmem lib successfully

[qnn_init, 2299]: initialize qnn context successfully

[qnn_init, 2302]: leave qni_init

[ggml_backend_qnn_init, 4011]: qnn device name QNN-CPU
[init_qnn_graph, 2406]: succeed to create graph QNN-CPU, 0xd4a50a2049bcdc2f

[main, 395]: creating new tensors

[main, 396]: ggml_blck_size(f32) 1
[main, 397]: ggml_type_size(f32) 4
[main, 436]: creating backend buffer

[main, 448]: creating compute graph

[ggml_qnn_can_handle_op, 2458]: op name:MUL_MAT, tensor type:f32
[ggml_qnn_can_handle_op, 2460]: src0 type:f32
[ggml_qnn_can_handle_op, 2463]: src1 type:f32
[ggml_qnn_can_handle_op, 2467]: GGML_OP_MUL_MAT
[ggml_qnn_can_handle_op, 2472]: src0        tensor_0: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)

[ggml_qnn_can_handle_op, 2477]: src1        tensor_1: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)

[ggml_qnn_can_handle_op, 2483]:             tensor_2: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)

[ggml_qnn_mul_mat, 2785]: call ggml_qnn_mul_mat

[ggml_qnn_mul_mat, 2789]:        tensor_0: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)

[ggml_qnn_mul_mat, 2793]:        tensor_1: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)

[ggml_qnn_mul_mat, 2797]:        tensor_2: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)

[ggml_qnn_mul_mat, 2798]: 4, 4, 1, 1
[ggml_qnn_mul_mat, 2799]: tensor0 name tensor_0
[ggml_qnn_mul_mat, 2800]: tensor1 name tensor_1
[ggml_qnn_mul_mat, 2801]: tensor2 name tensor_2
[ggml_qnn_mul_mat, 2828]: graph name ggml_op_qnn_mul_mat_1tensor_0_tensor_1
[ggml_qnn_logcallback, 2165]:     16.9ms [ DEBUG ] getNode OpPackage-Name : qti.aisw Node-Type : MatMul 
[ggml_qnn_logcallback, 2165]:     17.0ms [VERBOSE] validate	Node-Type : MatMul	Node-Name : ggml_op_mul_mat 
[ggml_qnn_logcallback, 2165]:     17.1ms [  INFO ] CpuGraph::finalize 
[ggml_qnn_logcallback, 2165]:     17.2ms [ DEBUG ] Setting data pointer for tensor ID: 1 
[ggml_qnn_logcallback, 2165]:     17.2ms [ DEBUG ] Setting data pointer for tensor ID: 2 
[ggml_qnn_logcallback, 2165]:     17.2ms [ DEBUG ] Setting data pointer for tensor ID: 3 
[ggml_qnn_logcallback, 2165]:     17.2ms [  INFO ] CpuGraph::execute 
[ggml_qnn_mul_mat, 2927]: duration of ggml_qnn_mul_mat : 10 milliseconds

[ggml_qnn_mul_mat, 2928]: call ggml_qnn_mul_mat done

[get_tensor_rank, 210]: tensor->rank 4

[get_tensor_rank, 211]: get_tensor_rank 2

[get_tensor_data_size, 223]: get_tensor_data_size 64
[get_tensor_data_size, 224]: ggml_nbytes(tensor) 64
[main, 464]: dump:

[tensor_dump, 191]: dump ggml tensor src0(tensor_0)
[tensor_dump, 195]:            src0: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_sum_elements, 151]:     0.05     0.68    -0.27    -0.28 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:    -0.47     0.77     0.41     0.14 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:    -0.69    -0.71    -0.81    -0.23 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:     0.37     0.36    -0.26     0.61 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 185]: 

[tensor_dump, 198]: 

[tensor_dump, 191]: dump ggml tensor src1(tensor_1)
[tensor_dump, 195]:            src1: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_sum_elements, 151]:    -0.48    -0.81    -0.61     0.53 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:    -0.04     0.87     0.64     0.17 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:    -0.22     0.94    -0.38    -0.78 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:    -0.97    -0.94    -0.35     0.94 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 185]: 

[tensor_dump, 198]: 

[tensor_dump, 191]: dump ggml tensor dst(tensor_2)
[tensor_dump, 195]:             dst: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_sum_elements, 151]:     0.97    -0.79    -0.47     0.98 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:    -0.33     0.24     0.56    -0.80 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:     0.16    -0.20     0.95    -0.08 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 151]:     0.48     0.09    -0.20     0.80 
[tensor_sum_elements, 155]: 

[tensor_sum_elements, 185]: 

[tensor_dump, 198]: 

[ggml_backend_qnn_free, 3753]: enter ggml_backend_qnn_free
[ggml_backend_qnn_free, 3755]: idx 0, name:qnn-cpu
[ggml_backend_qnn_free, 3764]: graph type:ADD
[qnn_finalize, 2318]: succeed to close rpcmem lib

[ggml_backend_qnn_free, 3786]: leave ggml_backend_qnn_free

QNN's RPC feature(which useful for QNN NPU(aka HTP/DSP) backend) was used in this PR and it works fine as expected.there are 2+GBytes ion memory could be used for offload ggml tensors in cgraph to NPU on Qualcomm Snapdragon 8 Gen 3 equipped Android phone.

This PR is a Minimum Viable PR style and functional PR in ggml community. it'll be great helpful for other community programmer/developer/AI expert to contribute codes/ideas to GGML QNN backend if this PR can be approved and merged to master branch. Together we might/should/could reach the final target: utilize the Hexagon NPU maximally with the highly well-designed/highly compact ggml machine learning framework. this is might be the exact GGML way in GGML community.

Todo

Qualcomm's QNN backend for GGML has some todo tasks to make this backend can be used in real commercial application:

Lack of implementation of other GGML-OPs using QNN API. I provide a GENERAL approach try to fix this problem in a standalone PR of refine ggml backend subsystem for mixed inference between CPU&GPU / CPU&NPU easily for ANY ggml backends(which the backend's ggml_backend_xxx_buffer_is_host return true) . this approach works as expected with whisper inference and llama inference in my personal ggml learning&study project.
Add more quantize data type supportive(AI expert should be here)
Peformance fine-tunning: the performance of the existing ggml qnn backend is weaker/poor then the original ggml because there are some sophisticated Qualcomm's dedicated technologies not used in this PR and the power of state-of-the-art Qualcomm's NPU(Hexagon Tensor Processor) was not utilized currently in this PR(I know the direction but limited by my knowledge of real/hardcore AI tech). The performance fine-tunning in ggml gnn-npu backend is a long-term task. the following is an example:

[qnn_op_ut, 2037]: dump tensors:
[tensor_dump, 1404]: dump ggml tensor src0(tensor_0): type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_dump, 1466]: (4x4 in 4x4)
    0.16     0.85    -0.80    -0.25 
   -0.28     0.66     0.98     0.67 
   -0.15     0.78    -0.45    -0.50 
    0.92     0.31    -0.72    -0.46 

[tensor_dump, 1404]: dump ggml tensor src1(tensor_1): type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_dump, 1466]: (4x4 in 4x4)
    0.53     0.86    -0.91    -0.27 
    0.62     0.35    -0.27     0.43 
    0.73     0.42    -0.81    -0.24 
    0.49     0.81    -0.88     0.64 

[tensor_dump, 1404]: dump ggml tensor dst(tensor_2): type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_dump, 1466]: (4x4 in 4x4)
    0.69     1.70    -1.70    -0.52 
    0.34     1.02     0.71     1.10 
    0.58     1.19    -1.26    -0.74 
    1.41     1.12    -1.60     0.18 

[ggml_backend_qnn_free, 3286]: enter ggml_backend_qnn_free
[ggml_backend_qnn_free, 3288]: idx 2, name:qnn-npu
[ggml_backend_qnn_free, 3300]: graph type:ADD
[qnn_finalize, 1258]: succeed to close rpcmem lib

[ggml_backend_qnn_free, 3313]: leave ggml_backend_qnn_free
[qnn_op_ut, 2067]: duration of ut GGML_OP_ADD using QNN backend QNN-NPU: 532 milliseconds
[test-qnn-npu.cpp, qnn_op_ut, 2068]: leave qnn_op_test

[tensor_dump, 1404]: dump ggml tensor src0(tensor_0): type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_dump, 1466]: (4x4 in 4x4)
   -0.96     0.64     0.75     0.27 
   -0.10     0.59    -0.70     0.20 
    0.78     0.98    -0.46     0.33 
   -0.01     0.72     0.78     0.79 

[tensor_dump, 1404]: dump ggml tensor src1(tensor_1): type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_dump, 1466]: (4x4 in 4x4)
   -0.87     0.89     0.76     0.94 
    0.22    -0.88    -0.63     0.80 
   -0.32     0.16     0.53     0.53 
   -0.78     0.13    -0.04    -0.34 

[test-qnn-npu.cpp, qnn_test_qnnnpu_2, 6330]: error = 0

[test-qnn-npu.cpp, qnn_test_qnnnpu_2, 6333]: output matrix:
[tensor_dump, 1404]: dump ggml tensor dst(tensor_2): type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_dump, 1466]: (4x4 in 4x4)
   -1.83     1.53     1.52     1.20 
    0.12    -0.29    -1.33     1.00 
    0.45     1.14     0.07     0.86 
   -0.80     0.85     0.75     0.45 

[test-qnn-npu.cpp, qnn_finalize, 4886]: succeed to close rpcmem lib

[info, 161]: duration of qnn_nputest_2_ADD : 233 milliseconds
[test-qnn-npu.cpp, qnn_test_qnnnpu_2, 6357]: leave qnn_rpc_test

[qnn_op_ut, 2037]: dump tensors:
[tensor_dump, 1404]: dump ggml tensor src0(tensor_0): type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_dump, 1466]: (4x4 in 4x4)
   59.00    59.00    59.00    59.00 
   59.00    59.00    59.00    59.00 
   59.00    59.00    59.00    59.00 
   59.00    59.00    59.00    59.00 

[tensor_dump, 1404]: dump ggml tensor src1(tensor_1): type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_dump, 1466]: (4x4 in 4x4)
   94.00    94.00    94.00    94.00 
   94.00    94.00    94.00    94.00 
   94.00    94.00    94.00    94.00 
   94.00    94.00    94.00    94.00 

[tensor_dump, 1404]: dump ggml tensor dst(tensor_2): type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_dump, 1466]: (4x4 in 4x4)
  153.00   153.00   153.00   153.00 
  153.00   153.00   153.00   153.00 
  153.00   153.00   153.00   153.00 
  153.00   153.00   153.00   153.00 

[qnn_op_ut, 2067]: duration of ut GGML_OP_ADD using QNN backend ggml: 3 milliseconds
[test-qnn-npu.cpp, qnn_op_ut, 2068]: leave qnn_op_test

How to verify QNN backend or participate in development activity of GGML QNN backend

I provide a dedicated Android command line program and scripts in this PR for purpose of UT on Android device.

The QNN SDK should be downloaded and installed accordingly from https://qpm.qualcomm.com/#/main/tools/details/qualcomm_ai_engine_direct . pls attention the latest QNN SDK provided/released by Qualcomm should be used.
use the dedicated program and scripts in directory tests/ggml-qnn to verify/validate QNN backend in command line mode on Qualcomm SoC equipped Android phone. This Android command line UT program has NO unnecessary external dependencies and easily to use, it is the recommended method to verify GGML QNN backend or further development activity of GGML QNN backend in project llama.cpp/ggml community:


 cd tests/ggml-qnn/
./ggml-qnn-ut-build-run.sh  -h              (show usage)
./ggml-qnn-ut-build-run.sh  help            (show usage)
./ggml-qnn-ut-build-run.sh  build           (build Android command line UT program)
./ggml-qnn-ut-build-run.sh  updateqnnlibs   (upload the latest QNN libs to Android phone)
./ggml-qnn-ut-build-run.sh  GGML_OP_ADD  0  (run UT program and verfiy QNN CPU backend on Android phone)
./ggml-qnn-ut-build-run.sh  GGML_OP_ADD  1  (run UT program and verfiy QNN GPU backend on Android phone)
./ggml-qnn-ut-build-run.sh  GGML_OP_ADD  2  (run UT program and verfiy QNN NPU backend on Android phone)
./ggml-qnn-ut-build-run.sh  GGML_OP_ADD  3  (compare performance between QNN backend and original ggml on Android phone)

A suitable/qualified reviewer should/might be familiar with source code of ggml and Qualcomm QNN(Qualcomm Neural Network, aka Qualcomm AI Engine Direct) SDK or other Qualcomm's AI software stack; skillsets including real/hardcore AI tech is more better(add more quantize data type and implement more GGML OPs(or kernels) require the AI skillset) but is not an essential skillset in this PR. some notes for potential qualified reviewer:

Programming language detail is not the key-point in this PR also language detail is really important and I will handle it properly as much as possible(this PR follow the coding style in upstream llama.cpp strictly/as much as possible), pls do NOT spent too much time on these language details : such as code format, code align, variable name, function name, unused variable, unused function, compiler warning, C++ grammar/syntax in so-called modern C++11/14/17/20...).
PR should/could be submitted in upstream llama.cpp if the PR is for fix issues/bugs in upstream llama.cpp(this is the reason why familiar with source code of ggml is an essential prerequisite for a suitable reviewer).
Don't bring too much complex new features in this PR, a MVP(Minimum Viable PR) style PR might be accepted by the maintainers of ggml community.
Pls focus on the real keypoint in this PR:how to utilize the Hexagon NPU maximally with the highly well-designed/highly compact ggml machine learning framework.

Any GGML community programmer/developer/AI expert who interesting with the topic of GGML QNN backend can use/extend the dedicated Android command line program to verify GGML QNN backend, review are greatly welcomed and appreciated.

… Direct) backend

github-actions · 2024-04-24T10:58:13Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 540 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8677.33ms p(95)=20035.75ms fails=, finish reason: stop=492 truncated=48
Prompt processing (pp): avg=95.63tk/s p(95)=443.17tk/s
Token generation (tg): avg=47.46tk/s p(95)=47.64tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=qualcomm_qnn_backend_for_ggml commit=a98a4e999000105b81b472c7b36ff80131d68ef1

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 540 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1718010789 --> 1718011423
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 593.29, 593.29, 593.29, 593.29, 593.29, 747.27, 747.27, 747.27, 747.27, 747.27, 756.49, 756.49, 756.49, 756.49, 756.49, 776.61, 776.61, 776.61, 776.61, 776.61, 836.27, 836.27, 836.27, 836.27, 836.27, 841.05, 841.05, 841.05, 841.05, 841.05, 838.87, 838.87, 838.87, 838.87, 838.87, 859.25, 859.25, 859.25, 859.25, 859.25, 865.08, 865.08, 865.08, 865.08, 865.08, 858.89, 858.89, 858.89, 858.89, 858.89, 883.85, 883.85, 883.85, 883.85, 883.85, 891.89, 891.89, 891.89, 891.89, 891.89, 873.3, 873.3, 873.3, 873.3, 873.3, 893.13, 893.13, 893.13, 893.13, 893.13, 909.48, 909.48, 909.48, 909.48, 909.48, 910.98, 910.98, 910.98, 910.98, 910.98, 911.31, 911.31, 911.31, 911.31, 911.31, 910.69, 910.69, 910.69, 910.69, 910.69, 914.6, 914.6, 914.6, 914.6, 914.6, 928.83, 928.83, 928.83, 928.83, 928.83, 927.37, 927.37, 927.37, 927.37, 927.37, 921.49, 921.49, 921.49, 921.49, 921.49, 925.25, 925.25, 925.25, 925.25, 925.25, 928.15, 928.15, 928.15, 928.15, 928.15, 942.74, 942.74, 942.74, 942.74, 942.74, 924.43, 924.43, 924.43, 924.43, 924.43, 923.95, 923.95, 923.95, 923.95, 923.95, 915.03, 915.03, 915.03, 915.03, 915.03, 911.66, 911.66, 911.66, 911.66, 911.66, 909.5, 909.5, 909.5, 909.5, 909.5, 914.04, 914.04, 914.04, 914.04, 914.04, 911.98, 911.98, 911.98, 911.98, 911.98, 910.75, 910.75, 910.75, 910.75, 910.75, 916.72, 916.72, 916.72, 916.72, 916.72, 926.62, 926.62, 926.62, 926.62, 926.62, 924.55, 924.55, 924.55, 924.55, 924.55, 927.08, 927.08, 927.08, 927.08, 927.08, 921.68, 921.68, 921.68, 921.68, 921.68, 920.82, 920.82, 920.82, 920.82, 920.82, 921.7, 921.7, 921.7, 921.7, 921.7, 922.98, 922.98, 922.98, 922.98, 922.98, 930.8, 930.8, 930.8, 930.8, 930.8, 921.59, 921.59, 921.59, 921.59, 921.59, 897.51, 897.51, 897.51, 897.51, 897.51, 894.98, 894.98, 894.98, 894.98, 894.98, 893.03, 893.03, 893.03, 893.03, 893.03, 895.37, 895.37, 895.37, 895.37, 895.37, 897.77, 897.77, 897.77, 897.77, 897.77, 896.81, 896.81, 896.81, 896.81, 896.81, 899.61, 899.61, 899.61, 899.61, 899.61, 898.83, 898.83, 898.83, 898.83, 898.83, 901.17, 901.17, 901.17, 901.17, 901.17, 890.73, 890.73, 890.73, 890.73, 890.73, 888.87, 888.87, 888.87, 888.87, 888.87, 889.05, 889.05, 889.05, 889.05, 889.05, 889.17, 889.17, 889.17, 889.17, 889.17, 888.29, 888.29, 888.29, 888.29, 888.29, 887.41, 887.41, 887.41, 887.41, 887.41, 888.05, 888.05, 888.05, 888.05, 888.05, 888.97, 888.97, 888.97, 888.97, 888.97, 889.62, 889.62, 889.62]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 540 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1718010789 --> 1718011423
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 41.93, 41.93, 41.93, 41.93, 41.93, 35.06, 35.06, 35.06, 35.06, 35.06, 27.87, 27.87, 27.87, 27.87, 27.87, 30.27, 30.27, 30.27, 30.27, 30.27, 31.24, 31.24, 31.24, 31.24, 31.24, 31.49, 31.49, 31.49, 31.49, 31.49, 32.61, 32.61, 32.61, 32.61, 32.61, 33.52, 33.52, 33.52, 33.52, 33.52, 33.92, 33.92, 33.92, 33.92, 33.92, 34.15, 34.15, 34.15, 34.15, 34.15, 34.26, 34.26, 34.26, 34.26, 34.26, 33.88, 33.88, 33.88, 33.88, 33.88, 33.24, 33.24, 33.24, 33.24, 33.24, 33.26, 33.26, 33.26, 33.26, 33.26, 31.54, 31.54, 31.54, 31.54, 31.54, 31.03, 31.03, 31.03, 31.03, 31.03, 29.95, 29.95, 29.95, 29.95, 29.95, 29.72, 29.72, 29.72, 29.72, 29.72, 29.96, 29.96, 29.96, 29.96, 29.96, 29.84, 29.84, 29.84, 29.84, 29.84, 29.65, 29.65, 29.65, 29.65, 29.65, 29.74, 29.74, 29.74, 29.74, 29.74, 29.88, 29.88, 29.88, 29.88, 29.88, 30.09, 30.09, 30.09, 30.09, 30.09, 30.2, 30.2, 30.2, 30.2, 30.2, 30.22, 30.22, 30.22, 30.22, 30.22, 30.51, 30.51, 30.51, 30.51, 30.51, 30.47, 30.47, 30.47, 30.47, 30.47, 30.42, 30.42, 30.42, 30.42, 30.42, 30.68, 30.68, 30.68, 30.68, 30.68, 30.77, 30.77, 30.77, 30.77, 30.77, 30.87, 30.87, 30.87, 30.87, 30.87, 31.02, 31.02, 31.02, 31.02, 31.02, 31.2, 31.2, 31.2, 31.2, 31.2, 31.05, 31.05, 31.05, 31.05, 31.05, 31.03, 31.03, 31.03, 31.03, 31.03, 30.8, 30.8, 30.8, 30.8, 30.8, 30.35, 30.35, 30.35, 30.35, 30.35, 30.3, 30.3, 30.3, 30.3, 30.3, 30.55, 30.55, 30.55, 30.55, 30.55, 30.68, 30.68, 30.68, 30.68, 30.68, 30.79, 30.79, 30.79, 30.79, 30.79, 30.73, 30.73, 30.73, 30.73, 30.73, 30.24, 30.24, 30.24, 30.24, 30.24, 29.97, 29.97, 29.97, 29.97, 29.97, 29.37, 29.37, 29.37, 29.37, 29.37, 29.03, 29.03, 29.03, 29.03, 29.03, 29.04, 29.04, 29.04, 29.04, 29.04, 29.1, 29.1, 29.1, 29.1, 29.1, 29.14, 29.14, 29.14, 29.14, 29.14, 29.24, 29.24, 29.24, 29.24, 29.24, 29.27, 29.27, 29.27, 29.27, 29.27, 29.28, 29.28, 29.28, 29.28, 29.28, 29.11, 29.11, 29.11, 29.11, 29.11, 29.13, 29.13, 29.13, 29.13, 29.13, 29.15, 29.15, 29.15, 29.15, 29.15, 29.24, 29.24, 29.24, 29.24, 29.24, 29.36, 29.36, 29.36, 29.36, 29.36, 29.45, 29.45, 29.45, 29.45, 29.45, 29.56, 29.56, 29.56, 29.56, 29.56, 29.62, 29.62, 29.62]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 540 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1718010789 --> 1718011423
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.19, 0.19, 0.19, 0.19, 0.19, 0.39, 0.39, 0.39, 0.39, 0.39, 0.22, 0.22, 0.22, 0.22, 0.22, 0.18, 0.18, 0.18, 0.18, 0.18, 0.24, 0.24, 0.24, 0.24, 0.24, 0.17, 0.17, 0.17, 0.17, 0.17, 0.1, 0.1, 0.1, 0.1, 0.1, 0.12, 0.12, 0.12, 0.12, 0.12, 0.09, 0.09, 0.09, 0.09, 0.09, 0.16, 0.16, 0.16, 0.16, 0.16, 0.22, 0.22, 0.22, 0.22, 0.22, 0.21, 0.21, 0.21, 0.21, 0.21, 0.26, 0.26, 0.26, 0.26, 0.26, 0.28, 0.28, 0.28, 0.28, 0.28, 0.36, 0.36, 0.36, 0.36, 0.36, 0.28, 0.28, 0.28, 0.28, 0.28, 0.29, 0.29, 0.29, 0.29, 0.29, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.21, 0.21, 0.21, 0.21, 0.21, 0.22, 0.22, 0.22, 0.22, 0.22, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.38, 0.38, 0.38, 0.38, 0.38, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.28, 0.28, 0.28, 0.28, 0.28, 0.12, 0.12, 0.12, 0.12, 0.12, 0.07, 0.07, 0.07, 0.07, 0.07, 0.17, 0.17, 0.17, 0.17, 0.17, 0.21, 0.21, 0.21, 0.21, 0.21, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.31, 0.31, 0.31, 0.31, 0.31, 0.28, 0.28, 0.28, 0.28, 0.28, 0.4, 0.4, 0.4, 0.4, 0.4, 0.22, 0.22, 0.22, 0.22, 0.22, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.1, 0.1, 0.1, 0.1, 0.1, 0.32, 0.32, 0.32, 0.32, 0.32, 0.51, 0.51, 0.51, 0.51, 0.51, 0.61, 0.61, 0.61, 0.61, 0.61, 0.48, 0.48, 0.48, 0.48, 0.48, 0.27, 0.27, 0.27, 0.27, 0.27, 0.23, 0.23, 0.23, 0.23, 0.23, 0.17, 0.17, 0.17, 0.17, 0.17, 0.22, 0.22, 0.22, 0.22, 0.22, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.21, 0.21, 0.21, 0.21, 0.21, 0.27, 0.27, 0.27, 0.27, 0.27, 0.08, 0.08, 0.08, 0.08, 0.08, 0.28, 0.28, 0.28, 0.28, 0.28, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.09, 0.09, 0.09, 0.09, 0.09, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 540 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1718010789 --> 1718011423
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0]

Dampfinchen · 2024-04-24T12:05:59Z

Nice. With competent LLMs getting smaller and more efficient as well as Snapdragon laptops coming soon, it's important to make full use of the AI acceleration these SoCs provide through the Hexagon NPU Cluster.

This will make llama.cpp a robust backend for the future and will lead to power efficient LLMs on the go. Personally, I really can't wait!

zhouwg · 2024-04-24T12:56:53Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 198 iterations 🚀

Expand details for performance related PR only

Nice. With competent LLMs getting smaller and more efficient as well as Snapdragon laptops coming soon, it's important to make full use of the AI acceleration these SoCs provide through the Hexagon NPU Cluster.

This will make llama.cpp a robust backend for the future and will lead to power efficient LLMs on the go. Personally, I really can't wait!

thanks for your comment. this PR is a very initial implementation and could/might/should be a good starting point of Qualcomm's QNN backend for GGML. it's better some domain technical experts from Qualcomm involved in this effort after it's accepted by community. I personally think this PR is also an example of GGML way: try crazy ideas, build wild demos, and push the edge of what’s possible.

another thing, a small and standalone Android example(or re-use the existing Android example in llama.cpp) is needed for purpose of facilitate community developers to participate in develop/verify QNN backend.

ggerganov · 2024-04-25T11:47:22Z

another thing, a small and standalone Android example(or re-use the existing Android example in llama.cpp) is needed for purpose of facilitate community developers to participate in develop/verify QNN backend.

Yes, it would be useful to have an example or instructions how to run this. In the meantime, simply setting up the test-backend-ops to run with ggml-qnn would be a good start for people who want to implement the missing operators

zhouwg · 2024-04-25T11:55:08Z

another thing, a small and standalone Android example(or re-use the existing Android example in llama.cpp) is needed for purpose of facilitate community developers to participate in develop/verify QNN backend.

Yes, it would be useful to have an example or instructions how to run this. In the meantime, simply setting up the test-backend-ops to run with ggml-qnn would be a good start for people who want to implement the missing operators

thanks for your guidance. I'll study how to use test-backend-ops.cpp to validate QNN backend.

slaren · 2024-04-25T12:24:46Z

You would need to modify ggml_backend_registry_init to register the backend, then it should be automatically used by test-backend-ops.

llama.cpp/ggml-backend.c

Line 411 in 5477041

GGML_CALL static void ggml_backend_registry_init(void) {

zhouwg · 2024-04-25T13:16:20Z

You would need to modify ggml_backend_registry_init to register the backend, then it should be automatically used by test-backend-ops.

llama.cpp/ggml-backend.c

Line 411 in 5477041

GGML_CALL static void ggml_backend_registry_init(void) {

thanks for your help, it's really helpful. I'm working on adapt to test-backend-ops.cpp with QNN backend on Android.

zhouwg · 2024-04-25T15:29:39Z

@ggerganov, @slaren, I'm sorry to interrupt you. adapt to test-backend-ops.cpp using QNN backend already done and it works fine as expected on Xiaomi14(Qualcomm SM8650-AB Snapdragon 8 Gen 3).

Could you take a moment to look at it? thanks.

BTW, the design and implementation of test-backend-ops.cpp is really excellent. I never noticed this file/feature before.

BTW, should the README-qnn.md be removed?

tests/test-backend-ops.cpp

…ing to review comments

…lained in #1

chraac · 2024-06-11T03:56:27Z

Thanks for the fix, good job! Now working on running this branch on my phone! Will leave a note here if have any problem!

ggml-qnn.cpp

chraac · 2024-06-17T04:21:31Z

ggml-qnn.cpp

+ qnn_instance * instance = nullptr;
+ std::string graph_name = "ggml_op_qnn_add";
+ Qnn_GraphHandle_t graph_handle = nullptr;
+ Qnn_Tensor_t * tensor_0 = nullptr;


Created a PR on your fork, to simpilify the binding from Qnn_Tensor_t to ggml_tensor, please have look if have time: zhouwg#2

chraac · 2024-06-17T04:40:59Z

ggml-qnn.cpp

+ * mul_mat_f16_f32: src0 is F16 and src1 is F32.
+ * mul_mat_q_f32: src0 is quantized (Q4_0, Q4_1, ...), and src1 is F32.
+ */
+static void ggml_qnn_mul_mat(ggml_backend_qnn_context * ctx,


Also find a maybe bug on this branch when trying to do mulmat with gpu backend on my 8 Gen2 phone, commandline:
ggml-qnn-ut -t GGML_OP_MUL_MAT -b 1

As you can see it generate a wrong dst matrix.

When running with cpu backend, the result is correct:

looks the graphExecute failed with error 6004. maybe we can use it to find the root cause here

to reproduce, you could use my patch to constant initialize the test tensor:

llama.cpp-5e18cdc-init the test array with const values.patch

just change the tensor init in the unit test so that we can reproduce it more easily

myan-o · 2024-06-18T06:18:43Z

i'm tred build in termux.
Can't you change the path of /data/local/tmp?
The Skel.so path cannot be changed in NPU and loading fails.

ggml: add Qualcomm QNN(Qualcomm Neural Network,aka Qualcomm AI Engine…

b0c3013

… Direct) backend

zhouwg mentioned this pull request Apr 24, 2024

log: refine log function for Android #6721

Closed

zhouwg force-pushed the qualcomm_qnn_backend_for_ggml branch 3 times, most recently from 59e42f8 to b0c3013 Compare April 24, 2024 10:26

zhouwg force-pushed the qualcomm_qnn_backend_for_ggml branch 3 times, most recently from 8ade7aa to f9e1b9a Compare April 25, 2024 04:14

zhouwg mentioned this pull request Apr 25, 2024

doc: add README-qnn.md #6897

Closed

zhouwg force-pushed the qualcomm_qnn_backend_for_ggml branch 2 times, most recently from 5abb2e4 to 7a420e1 Compare April 25, 2024 08:11

zhouwg changed the title ~~ggml: add Qualcomm QNN(Qualcomm Neural Network,aka Qualcomm AI Engine Direct) backend~~ ggml-qnn: add Qualcomm QNN(Qualcomm Neural Network,aka Qualcomm AI Engine Direct) backend Apr 25, 2024

zhouwg force-pushed the qualcomm_qnn_backend_for_ggml branch 3 times, most recently from 95a980a to b0c3013 Compare April 25, 2024 09:03

zhouwg force-pushed the qualcomm_qnn_backend_for_ggml branch 2 times, most recently from eff9669 to 180ab5f Compare April 25, 2024 15:47

slaren reviewed Apr 26, 2024

View reviewed changes

tests/test-backend-ops.cpp Outdated Show resolved Hide resolved

zhouwg force-pushed the qualcomm_qnn_backend_for_ggml branch 4 times, most recently from 992cf05 to 67beeb6 Compare April 26, 2024 02:12

review: code format using clang-format + manually modification accord…

fdf0272

…ing to review comments

zhouwg force-pushed the qualcomm_qnn_backend_for_ggml branch from 375b5e5 to fdf0272 Compare June 9, 2024 01:06

review: fix a memory leak introduced by review modification which exp…

3e8b61f

…lained in #1

zhouwg force-pushed the qualcomm_qnn_backend_for_ggml branch 4 times, most recently from dafa5f1 to 3e8b61f Compare June 9, 2024 15:49

npu: probe htp info and capacity of rpc ion memory

d38d4a6

zhouwg force-pushed the qualcomm_qnn_backend_for_ggml branch 4 times, most recently from a98a4e9 to d38d4a6 Compare June 10, 2024 12:07

ggml-qnn: refine source code of ggml-qnn.cpp to make reviewer more happy

5f8cfe4

zhouwg force-pushed the qualcomm_qnn_backend_for_ggml branch 3 times, most recently from 9e1009c to 5f8cfe4 Compare June 11, 2024 15:04

ggml-qnn: refine ggml inference using QNN NPU

5269e08

chraac reviewed Jun 11, 2024

View reviewed changes

ggml-qnn.cpp Outdated Show resolved Hide resolved

chraac reviewed Jun 11, 2024

View reviewed changes

ggml-qnn.cpp Show resolved Hide resolved

zhouwg force-pushed the qualcomm_qnn_backend_for_ggml branch 2 times, most recently from 5a65c86 to 5269e08 Compare June 12, 2024 08:30

ggml-qnn: refine ggml inference using QNN NPU

faaa86b

zhouwg force-pushed the qualcomm_qnn_backend_for_ggml branch 2 times, most recently from c42d045 to faaa86b Compare June 13, 2024 07:41

review: make a MVP(Minimum Viable PR) style PR in upstream

5598fbd

chraac mentioned this pull request Jun 17, 2024

Refactoring: add helper class to bind qnn tensor -> ggml tensor zhouwg/llama.cpp#2

Open

4 tasks

chraac reviewed Jun 17, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-qnn: add Qualcomm QNN(Qualcomm Neural Network,aka Qualcomm AI Engine Direct) backend #6869

ggml-qnn: add Qualcomm QNN(Qualcomm Neural Network,aka Qualcomm AI Engine Direct) backend #6869

zhouwg commented Apr 24, 2024 •

edited

github-actions bot commented Apr 24, 2024 •

edited

Dampfinchen commented Apr 24, 2024

zhouwg commented Apr 24, 2024 •

edited

ggerganov commented Apr 25, 2024

zhouwg commented Apr 25, 2024 •

edited

slaren commented Apr 25, 2024

zhouwg commented Apr 25, 2024 •

edited

zhouwg commented Apr 25, 2024 •

edited

chraac commented Jun 11, 2024 •

edited

chraac Jun 17, 2024

chraac Jun 17, 2024 •

edited

chraac Jun 17, 2024 •

edited

chraac Jun 17, 2024 •

edited

myan-o commented Jun 18, 2024

ggml-qnn: add Qualcomm QNN(Qualcomm Neural Network,aka Qualcomm AI Engine Direct) backend #6869

Are you sure you want to change the base?

ggml-qnn: add Qualcomm QNN(Qualcomm Neural Network,aka Qualcomm AI Engine Direct) backend #6869

Conversation

zhouwg commented Apr 24, 2024 • edited

Purpose

Status

Todo

How to verify QNN backend or participate in development activity of GGML QNN backend

github-actions bot commented Apr 24, 2024 • edited

Dampfinchen commented Apr 24, 2024

zhouwg commented Apr 24, 2024 • edited

ggerganov commented Apr 25, 2024

zhouwg commented Apr 25, 2024 • edited

slaren commented Apr 25, 2024

zhouwg commented Apr 25, 2024 • edited

zhouwg commented Apr 25, 2024 • edited

chraac commented Jun 11, 2024 • edited

chraac Jun 17, 2024

Choose a reason for hiding this comment

chraac Jun 17, 2024 • edited

Choose a reason for hiding this comment

chraac Jun 17, 2024 • edited

Choose a reason for hiding this comment

chraac Jun 17, 2024 • edited

Choose a reason for hiding this comment

myan-o commented Jun 18, 2024

zhouwg commented Apr 24, 2024 •

edited

github-actions bot commented Apr 24, 2024 •

edited

zhouwg commented Apr 24, 2024 •

edited

zhouwg commented Apr 25, 2024 •

edited

zhouwg commented Apr 25, 2024 •

edited

zhouwg commented Apr 25, 2024 •

edited

chraac commented Jun 11, 2024 •

edited

chraac Jun 17, 2024 •

edited

chraac Jun 17, 2024 •

edited

chraac Jun 17, 2024 •

edited