Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml-qnn: add Qualcomm QNN(Qualcomm Neural Network,aka Qualcomm AI Engine Direct) backend #6869

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

zhouwg
Copy link
Contributor

@zhouwg zhouwg commented Apr 24, 2024

Self Reported Review Complexity

  • Review Complexity : Low
  • Review Complexity : Medium
  • Review Complexity : High
  • I have read the contributing guidelines

Purpose

Android maintained its position as the leading mobile operating system worldwide in the fourth quarter of 2023 with a market share of 70.1 percent .

Qualcomm is No.1 mobile SoC semiconductor company in our planet currently(MediaTek's market share is No.1 in Q1 2024 but I personally think Qualcomm is the real No.1 mobile SoC vendor in our planet). Hexagon NPU in Qualcomm Snapdragon 8 Gen 3 was designed for generative AI and delivering 98% faster performance and 40% improved performance-per-watt for sustained AI inferencing, it make the Hexagon NPU the leading processor for on-device AI inferencing.

QNN(Qualcomm Neural Network, aka Qualcomm AI Engine Direct) SDK is verified to work with the following versions of the ML frameworks:

  • TensorFlow: tf-1.15.0, or tf-2.10.1
  • TFLite: tflite-2.3.0
  • PyTorch: torch-1.13.1
  • ONNX: onnx-1.11.0

As a very compact/highly well-designed/highly optimization/highly performance C/C++ machine learning framework/library, this PR aims to add Qualcomm's QNN backend for ggml and focus on this accordingly:how to utilize the Hexagon NPU maximally with the highly well-designed/highly compact ggml machine learning framework.

Status

Data path works fine as expected with whisper.cpp and llama.cpp using QNN backend and verified on both low-end and high-end Android phones based on Qualcomm mobile SoC.

    319780607

    504893116

4x performance gains for GGML_OP_MUL_MAT using QNN CPU backend with 1 thread on a Qualcomm mobile SoC equipped high-end Android phone(a flagship Qualcomm Snapdragon 8 Gen 3 mobile SoC which released on Oct 2023). The performance of GGML_OP_MUL_MAT might/should/would be improved much more using QNN NPU(aka Hexagon Tensor Processor) backend after we know the secrets(QNN RPC, multithreading in NPU backend......) of Qualcomm's NPU(aka Hexagon Tensor Processor).

    1922265373

    250505401

A dedicated Android command line program (for purpose of UT) works fine as expected on Qualcomm SM8650-AB Snapdragon 8 Gen 3 equipped high-end Android phone and other Qualcomm's low-end mobile SoC equipped low-end Android phone(QNN NPU backend not works on Qualcomm low-end Android phone).
    /data/local/tmp//libQnnCpu.so
    QNN libs already exist on Android phone
    ggml-qnn-test: 1 file pushed. 16.3 MB/s (4567168 bytes in 0.267s)
    [main, 344]: enter qnn_ggml_op
    
    [main, 345]: ggml op:2(ADD)
    [main, 359]: Allocating Memory of size 33554432 bytes, 32 MB
    
    [ggml_backend_qnn_init, 3955]: device 0
    [ggml_backend_qnn_init, 3956]: qnn_lib_path /data/local/tmp/
    [qnn_init, 2172]: enter qni_init
    
    [load_system, 2033]: system_lib_path:/data/local/tmp/libQnnSystem.so
    
    [load_system, 2082]: find a valid qnn system interface
    
    [load_system, 2092]: initialize qnn system successfully
    
    [qnn_init, 2180]: load QNN system lib successfully
    
    [load_backend, 1911]: lib_path:/data/local/tmp/libQnnCpu.so
    
    [load_backend, 1935]: num_providers=1
    
    [load_backend, 1960]: find a valid qnn interface
    
    [load_backend, 2005]: saver_initialize is null
    
    [qnn_init, 2213]: initialize qnn log successfully
    
    [qnn_init, 2224]: initialize qnn backend successfully
    
    [qnn_init, 2230]: device property is not supported
    
    [qnn_init, 2241]: create device successfully
    
    [qnn_init, 2245]: profiling turned on; level = 2
    [qnn_init, 2256]: detailed profiling requested. Creating Qnn Profile object
    
    [qnn_init, 2262]: initialize qnn profile successfully
    
    [qnn_init, 2272]: load rpcmem lib successfully
    
    [qnn_init, 2299]: initialize qnn context successfully
    
    [qnn_init, 2302]: leave qni_init
    
    [ggml_backend_qnn_init, 4011]: qnn device name QNN-CPU
    [init_qnn_graph, 2406]: succeed to create graph QNN-CPU, 0xd4a54a2a43bcdc2f
    
    [main, 395]: creating new tensors
    
    [main, 396]: ggml_blck_size(f32) 1
    [main, 397]: ggml_type_size(f32) 4
    [main, 436]: creating backend buffer
    
    [main, 448]: creating compute graph
    
    [ggml_qnn_can_handle_op, 2458]: op name:ADD, tensor type:f32
    [ggml_qnn_can_handle_op, 2460]: src0 type:f32
    [ggml_qnn_can_handle_op, 2463]: src1 type:f32
    [ggml_qnn_add, 2574]: call ggml_qnn_add
    
    [ggml_qnn_add, 2578]:        tensor_0: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
    
    [ggml_qnn_add, 2582]:        tensor_1: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
    
    [ggml_qnn_add, 2586]:        tensor_2: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
    
    [ggml_qnn_add, 2587]: 4, 4, 1, 1
    [ggml_qnn_add, 2588]: tensor0 name tensor_0
    [ggml_qnn_add, 2589]: tensor1 name tensor_1
    [ggml_qnn_add, 2590]: tensor2 name tensor_2
    [ggml_qnn_add, 2617]: graph name ggml_op_qnn_add_1tensor_0_tensor_1
    [ggml_qnn_logcallback, 2165]:     11.5ms [ DEBUG ] getNode OpPackage-Name : qti.aisw Node-Type : ElementWiseAdd 
    [ggml_qnn_logcallback, 2165]:     11.5ms [VERBOSE] validate	Node-Type : ElementWiseAdd	Node-Name : ggml_op_add 
    [ggml_qnn_logcallback, 2165]:     11.7ms [  INFO ] CpuGraph::finalize 
    [ggml_qnn_logcallback, 2165]:     11.7ms [ DEBUG ] Setting data pointer for tensor ID: 1 
    [ggml_qnn_logcallback, 2165]:     11.7ms [ DEBUG ] Setting data pointer for tensor ID: 2 
    [ggml_qnn_logcallback, 2165]:     11.7ms [ DEBUG ] Setting data pointer for tensor ID: 3 
    [ggml_qnn_logcallback, 2165]:     11.7ms [  INFO ] CpuGraph::execute 
    [get_tensor_rank, 210]: tensor->rank 4
    
    [get_tensor_rank, 211]: get_tensor_rank 2
    
    [get_tensor_data_size, 223]: get_tensor_data_size 64
    [get_tensor_data_size, 224]: ggml_nbytes(tensor) 64
    [main, 464]: dump:
    
    [tensor_dump, 191]: dump ggml tensor src0(tensor_0)
    [tensor_dump, 195]:            src0: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
    [tensor_sum_elements, 151]:    -0.84     0.23    -0.07    -0.25 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:     0.10    -0.32    -0.96     0.28 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:    -0.63    -0.59     0.29    -1.00 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:    -0.01     0.10     0.92     0.54 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 185]: 
    
    [tensor_dump, 198]: 
    
    [tensor_dump, 191]: dump ggml tensor src1(tensor_1)
    [tensor_dump, 195]:            src1: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
    [tensor_sum_elements, 151]:     0.99    -0.43    -0.41    -0.44 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:    -0.06     0.64    -0.61    -0.98 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:    -0.86    -0.11     0.41     0.27 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:     0.54    -0.70    -0.90    -0.13 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 185]: 
    
    [tensor_dump, 198]: 
    
    [tensor_dump, 191]: dump ggml tensor dst(tensor_2)
    [tensor_dump, 195]:             dst: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
    [tensor_sum_elements, 151]:     0.15    -0.19    -0.48    -0.69 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:     0.04     0.32    -1.57    -0.70 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:    -1.49    -0.70     0.70    -0.73 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:     0.53    -0.60     0.02     0.42 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 185]: 
    
    [tensor_dump, 198]: 
    
    [ggml_backend_qnn_free, 3753]: enter ggml_backend_qnn_free
    [ggml_backend_qnn_free, 3755]: idx 0, name:qnn-cpu
    [ggml_backend_qnn_free, 3764]: graph type:ADD
    [qnn_finalize, 2318]: succeed to close rpcmem lib
    
    [ggml_backend_qnn_free, 3786]: leave ggml_backend_qnn_free
    
    
    [ggml_backend_qnn_init, 3955]: device 0
    [ggml_backend_qnn_init, 3956]: qnn_lib_path /data/local/tmp/
    [qnn_init, 2172]: enter qni_init
    
    [load_system, 2033]: system_lib_path:/data/local/tmp/libQnnSystem.so
    
    [load_system, 2082]: find a valid qnn system interface
    
    [load_system, 2092]: initialize qnn system successfully
    
    [qnn_init, 2180]: load QNN system lib successfully
    
    [load_backend, 1911]: lib_path:/data/local/tmp/libQnnCpu.so
    
    [load_backend, 1935]: num_providers=1
    
    [load_backend, 1960]: find a valid qnn interface
    
    [load_backend, 2005]: saver_initialize is null
    
    [qnn_init, 2213]: initialize qnn log successfully
    
    [qnn_init, 2224]: initialize qnn backend successfully
    
    [qnn_init, 2230]: device property is not supported
    
    [qnn_init, 2241]: create device successfully
    
    [qnn_init, 2245]: profiling turned on; level = 2
    [qnn_init, 2256]: detailed profiling requested. Creating Qnn Profile object
    
    [qnn_init, 2262]: initialize qnn profile successfully
    
    [qnn_init, 2272]: load rpcmem lib successfully
    
    [qnn_init, 2299]: initialize qnn context successfully
    
    [qnn_init, 2302]: leave qni_init
    
    [ggml_backend_qnn_init, 4011]: qnn device name QNN-CPU
    [init_qnn_graph, 2406]: succeed to create graph QNN-CPU, 0xd4a54a5b40bcdc2f
    
    [main, 395]: creating new tensors
    
    [main, 396]: ggml_blck_size(f32) 1
    [main, 397]: ggml_type_size(f32) 4
    [main, 436]: creating backend buffer
    
    [main, 448]: creating compute graph
    
    [ggml_qnn_can_handle_op, 2458]: op name:MUL, tensor type:f32
    [ggml_qnn_can_handle_op, 2460]: src0 type:f32
    [ggml_qnn_can_handle_op, 2463]: src1 type:f32
    [ggml_qnn_hanlde_op, 2993]: call ggml_qnn_hanlde_op
    
    [ggml_qnn_hanlde_op, 2997]:        tensor_0: type = 0 (  f32)  ne =     4 x     4 x     1, nb = (    4,    16,    64)
    
    [ggml_qnn_hanlde_op, 3001]:        tensor_1: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
    
    [ggml_qnn_hanlde_op, 3005]:        tensor_2: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
    
    [ggml_qnn_hanlde_op, 3006]: 4, 4, 1, 1
    [ggml_qnn_hanlde_op, 3007]: tensor0 name tensor_0
    [ggml_qnn_hanlde_op, 3008]: tensor1 name tensor_1
    [ggml_qnn_hanlde_op, 3009]: tensor2 name tensor_2
    [ggml_qnn_hanlde_op, 3033]: qnn graph name ggml_qnn_graph_MUL1tensor_0_tensor_1
    [ggml_qnn_hanlde_op, 3034]: qnn op_config name ggml_qnn_op_config_MUL1tensor_0_tensor_1
    [ggml_qnn_logcallback, 2165]:     17.7ms [ DEBUG ] getNode OpPackage-Name : qti.aisw Node-Type : ElementWiseMultiply 
    [ggml_qnn_logcallback, 2165]:     17.8ms [VERBOSE] validate	Node-Type : ElementWiseMultiply	Node-Name : ggml_qnn_op_config_MUL1tensor_0_tensor_1 
    [ggml_qnn_logcallback, 2165]:     18.0ms [  INFO ] CpuGraph::finalize 
    [ggml_qnn_logcallback, 2165]:     18.1ms [ DEBUG ] Setting data pointer for tensor ID: 1 
    [ggml_qnn_logcallback, 2165]:     18.1ms [ DEBUG ] Setting data pointer for tensor ID: 2 
    [ggml_qnn_logcallback, 2165]:     18.1ms [ DEBUG ] Setting data pointer for tensor ID: 3 
    [ggml_qnn_logcallback, 2165]:     18.1ms [  INFO ] CpuGraph::execute 
    [ggml_qnn_hanlde_op, 3134]: duration of ggml_qnn_MUL : 0 milliseconds
    
    [ggml_qnn_hanlde_op, 3135]: call ggml_qnn_hanlde_op done
    
    [get_tensor_rank, 210]: tensor->rank 4
    
    [get_tensor_rank, 211]: get_tensor_rank 2
    
    [get_tensor_data_size, 223]: get_tensor_data_size 64
    [get_tensor_data_size, 224]: ggml_nbytes(tensor) 64
    [main, 464]: dump:
    
    [tensor_dump, 191]: dump ggml tensor src0(tensor_0)
    [tensor_dump, 195]:            src0: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
    [tensor_sum_elements, 151]:    -0.62     0.59    -0.34     0.40 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:    -0.81     0.33     0.52     0.01 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:    -0.37     0.43     0.97     0.06 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:     0.28     0.09    -0.57    -0.02 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 185]: 
    
    [tensor_dump, 198]: 
    
    [tensor_dump, 191]: dump ggml tensor src1(tensor_1)
    [tensor_dump, 195]:            src1: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
    [tensor_sum_elements, 151]:     0.24    -0.57    -0.17     0.36 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:    -0.83    -0.64     0.23    -0.87 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:    -0.25    -0.31     0.55     0.64 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:    -0.42     0.42     0.96     0.88 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 185]: 
    
    [tensor_dump, 198]: 
    
    [tensor_dump, 191]: dump ggml tensor dst(tensor_2)
    [tensor_dump, 195]:             dst: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
    [tensor_sum_elements, 151]:    -0.15    -0.34     0.06     0.14 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:     0.67    -0.21     0.12    -0.01 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:     0.09    -0.13     0.53     0.04 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:    -0.12     0.04    -0.55    -0.01 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 185]: 
    
    [tensor_dump, 198]: 
    
    [ggml_backend_qnn_free, 3753]: enter ggml_backend_qnn_free
    [ggml_backend_qnn_free, 3755]: idx 0, name:qnn-cpu
    [ggml_backend_qnn_free, 3764]: graph type:MUL
    [qnn_finalize, 2318]: succeed to close rpcmem lib
    
    [ggml_backend_qnn_free, 3786]: leave ggml_backend_qnn_free
    
    /data/local/tmp//libQnnCpu.so
    QNN libs already exist on Android phone
    ggml-qnn-test: 1 file pushed. 20.3 MB/s (4567168 bytes in 0.215s)
    [main, 344]: enter qnn_ggml_op
    
    [main, 345]: ggml op:23(MUL_MAT)
    [main, 359]: Allocating Memory of size 33554432 bytes, 32 MB
    
    [ggml_backend_qnn_init, 3955]: device 0
    [ggml_backend_qnn_init, 3956]: qnn_lib_path /data/local/tmp/
    [qnn_init, 2172]: enter qni_init
    
    [load_system, 2033]: system_lib_path:/data/local/tmp/libQnnSystem.so
    
    [load_system, 2082]: find a valid qnn system interface
    
    [load_system, 2092]: initialize qnn system successfully
    
    [qnn_init, 2180]: load QNN system lib successfully
    
    [load_backend, 1911]: lib_path:/data/local/tmp/libQnnCpu.so
    
    [load_backend, 1935]: num_providers=1
    
    [load_backend, 1960]: find a valid qnn interface
    
    [load_backend, 2005]: saver_initialize is null
    
    [qnn_init, 2213]: initialize qnn log successfully
    
    [qnn_init, 2224]: initialize qnn backend successfully
    
    [qnn_init, 2230]: device property is not supported
    
    [qnn_init, 2241]: create device successfully
    
    [qnn_init, 2245]: profiling turned on; level = 2
    [qnn_init, 2256]: detailed profiling requested. Creating Qnn Profile object
    
    [qnn_init, 2262]: initialize qnn profile successfully
    
    [qnn_init, 2272]: load rpcmem lib successfully
    
    [qnn_init, 2299]: initialize qnn context successfully
    
    [qnn_init, 2302]: leave qni_init
    
    [ggml_backend_qnn_init, 4011]: qnn device name QNN-CPU
    [init_qnn_graph, 2406]: succeed to create graph QNN-CPU, 0xd4a50a2049bcdc2f
    
    [main, 395]: creating new tensors
    
    [main, 396]: ggml_blck_size(f32) 1
    [main, 397]: ggml_type_size(f32) 4
    [main, 436]: creating backend buffer
    
    [main, 448]: creating compute graph
    
    [ggml_qnn_can_handle_op, 2458]: op name:MUL_MAT, tensor type:f32
    [ggml_qnn_can_handle_op, 2460]: src0 type:f32
    [ggml_qnn_can_handle_op, 2463]: src1 type:f32
    [ggml_qnn_can_handle_op, 2467]: GGML_OP_MUL_MAT
    [ggml_qnn_can_handle_op, 2472]: src0        tensor_0: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
    
    [ggml_qnn_can_handle_op, 2477]: src1        tensor_1: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
    
    [ggml_qnn_can_handle_op, 2483]:             tensor_2: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
    
    [ggml_qnn_mul_mat, 2785]: call ggml_qnn_mul_mat
    
    [ggml_qnn_mul_mat, 2789]:        tensor_0: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
    
    [ggml_qnn_mul_mat, 2793]:        tensor_1: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
    
    [ggml_qnn_mul_mat, 2797]:        tensor_2: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
    
    [ggml_qnn_mul_mat, 2798]: 4, 4, 1, 1
    [ggml_qnn_mul_mat, 2799]: tensor0 name tensor_0
    [ggml_qnn_mul_mat, 2800]: tensor1 name tensor_1
    [ggml_qnn_mul_mat, 2801]: tensor2 name tensor_2
    [ggml_qnn_mul_mat, 2828]: graph name ggml_op_qnn_mul_mat_1tensor_0_tensor_1
    [ggml_qnn_logcallback, 2165]:     16.9ms [ DEBUG ] getNode OpPackage-Name : qti.aisw Node-Type : MatMul 
    [ggml_qnn_logcallback, 2165]:     17.0ms [VERBOSE] validate	Node-Type : MatMul	Node-Name : ggml_op_mul_mat 
    [ggml_qnn_logcallback, 2165]:     17.1ms [  INFO ] CpuGraph::finalize 
    [ggml_qnn_logcallback, 2165]:     17.2ms [ DEBUG ] Setting data pointer for tensor ID: 1 
    [ggml_qnn_logcallback, 2165]:     17.2ms [ DEBUG ] Setting data pointer for tensor ID: 2 
    [ggml_qnn_logcallback, 2165]:     17.2ms [ DEBUG ] Setting data pointer for tensor ID: 3 
    [ggml_qnn_logcallback, 2165]:     17.2ms [  INFO ] CpuGraph::execute 
    [ggml_qnn_mul_mat, 2927]: duration of ggml_qnn_mul_mat : 10 milliseconds
    
    [ggml_qnn_mul_mat, 2928]: call ggml_qnn_mul_mat done
    
    [get_tensor_rank, 210]: tensor->rank 4
    
    [get_tensor_rank, 211]: get_tensor_rank 2
    
    [get_tensor_data_size, 223]: get_tensor_data_size 64
    [get_tensor_data_size, 224]: ggml_nbytes(tensor) 64
    [main, 464]: dump:
    
    [tensor_dump, 191]: dump ggml tensor src0(tensor_0)
    [tensor_dump, 195]:            src0: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
    [tensor_sum_elements, 151]:     0.05     0.68    -0.27    -0.28 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:    -0.47     0.77     0.41     0.14 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:    -0.69    -0.71    -0.81    -0.23 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:     0.37     0.36    -0.26     0.61 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 185]: 
    
    [tensor_dump, 198]: 
    
    [tensor_dump, 191]: dump ggml tensor src1(tensor_1)
    [tensor_dump, 195]:            src1: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
    [tensor_sum_elements, 151]:    -0.48    -0.81    -0.61     0.53 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:    -0.04     0.87     0.64     0.17 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:    -0.22     0.94    -0.38    -0.78 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:    -0.97    -0.94    -0.35     0.94 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 185]: 
    
    [tensor_dump, 198]: 
    
    [tensor_dump, 191]: dump ggml tensor dst(tensor_2)
    [tensor_dump, 195]:             dst: type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
    [tensor_sum_elements, 151]:     0.97    -0.79    -0.47     0.98 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:    -0.33     0.24     0.56    -0.80 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:     0.16    -0.20     0.95    -0.08 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 151]:     0.48     0.09    -0.20     0.80 
    [tensor_sum_elements, 155]: 
    
    [tensor_sum_elements, 185]: 
    
    [tensor_dump, 198]: 
    
    [ggml_backend_qnn_free, 3753]: enter ggml_backend_qnn_free
    [ggml_backend_qnn_free, 3755]: idx 0, name:qnn-cpu
    [ggml_backend_qnn_free, 3764]: graph type:ADD
    [qnn_finalize, 2318]: succeed to close rpcmem lib
    
    [ggml_backend_qnn_free, 3786]: leave ggml_backend_qnn_free
    
QNN's RPC feature(which useful for QNN NPU(aka HTP/DSP) backend) was used in this PR and it works fine as expected.there are 2+GBytes ion memory could be used for offload ggml tensors in cgraph to NPU on Qualcomm Snapdragon 8 Gen 3 equipped Android phone.
This PR is a Minimum Viable PR style and functional PR in ggml community. it'll be great helpful for other community programmer/developer/AI expert to contribute codes/ideas to GGML QNN backend if this PR can be approved and merged to master branch. Together we might/should/could reach the final target: utilize the Hexagon NPU maximally with the highly well-designed/highly compact ggml machine learning framework. this is might be the exact GGML way in GGML community.

Todo

Qualcomm's QNN backend for GGML has some todo tasks to make this backend can be used in real commercial application:
[qnn_op_ut, 2037]: dump tensors:
[tensor_dump, 1404]: dump ggml tensor src0(tensor_0): type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_dump, 1466]: (4x4 in 4x4)
    0.16     0.85    -0.80    -0.25 
   -0.28     0.66     0.98     0.67 
   -0.15     0.78    -0.45    -0.50 
    0.92     0.31    -0.72    -0.46 

[tensor_dump, 1404]: dump ggml tensor src1(tensor_1): type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_dump, 1466]: (4x4 in 4x4)
    0.53     0.86    -0.91    -0.27 
    0.62     0.35    -0.27     0.43 
    0.73     0.42    -0.81    -0.24 
    0.49     0.81    -0.88     0.64 

[tensor_dump, 1404]: dump ggml tensor dst(tensor_2): type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_dump, 1466]: (4x4 in 4x4)
    0.69     1.70    -1.70    -0.52 
    0.34     1.02     0.71     1.10 
    0.58     1.19    -1.26    -0.74 
    1.41     1.12    -1.60     0.18 

[ggml_backend_qnn_free, 3286]: enter ggml_backend_qnn_free
[ggml_backend_qnn_free, 3288]: idx 2, name:qnn-npu
[ggml_backend_qnn_free, 3300]: graph type:ADD
[qnn_finalize, 1258]: succeed to close rpcmem lib

[ggml_backend_qnn_free, 3313]: leave ggml_backend_qnn_free
[qnn_op_ut, 2067]: duration of ut GGML_OP_ADD using QNN backend QNN-NPU: 532 milliseconds
[test-qnn-npu.cpp, qnn_op_ut, 2068]: leave qnn_op_test
[tensor_dump, 1404]: dump ggml tensor src0(tensor_0): type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_dump, 1466]: (4x4 in 4x4)
   -0.96     0.64     0.75     0.27 
   -0.10     0.59    -0.70     0.20 
    0.78     0.98    -0.46     0.33 
   -0.01     0.72     0.78     0.79 

[tensor_dump, 1404]: dump ggml tensor src1(tensor_1): type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_dump, 1466]: (4x4 in 4x4)
   -0.87     0.89     0.76     0.94 
    0.22    -0.88    -0.63     0.80 
   -0.32     0.16     0.53     0.53 
   -0.78     0.13    -0.04    -0.34 

[test-qnn-npu.cpp, qnn_test_qnnnpu_2, 6330]: error = 0

[test-qnn-npu.cpp, qnn_test_qnnnpu_2, 6333]: output matrix:
[tensor_dump, 1404]: dump ggml tensor dst(tensor_2): type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_dump, 1466]: (4x4 in 4x4)
   -1.83     1.53     1.52     1.20 
    0.12    -0.29    -1.33     1.00 
    0.45     1.14     0.07     0.86 
   -0.80     0.85     0.75     0.45 

[test-qnn-npu.cpp, qnn_finalize, 4886]: succeed to close rpcmem lib

[info, 161]: duration of qnn_nputest_2_ADD : 233 milliseconds
[test-qnn-npu.cpp, qnn_test_qnnnpu_2, 6357]: leave qnn_rpc_test
[qnn_op_ut, 2037]: dump tensors:
[tensor_dump, 1404]: dump ggml tensor src0(tensor_0): type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_dump, 1466]: (4x4 in 4x4)
   59.00    59.00    59.00    59.00 
   59.00    59.00    59.00    59.00 
   59.00    59.00    59.00    59.00 
   59.00    59.00    59.00    59.00 

[tensor_dump, 1404]: dump ggml tensor src1(tensor_1): type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_dump, 1466]: (4x4 in 4x4)
   94.00    94.00    94.00    94.00 
   94.00    94.00    94.00    94.00 
   94.00    94.00    94.00    94.00 
   94.00    94.00    94.00    94.00 

[tensor_dump, 1404]: dump ggml tensor dst(tensor_2): type = 0 (  f32) ne =     4 x     4 x     1, nb = (    4,    16,    64)
[tensor_dump, 1466]: (4x4 in 4x4)
  153.00   153.00   153.00   153.00 
  153.00   153.00   153.00   153.00 
  153.00   153.00   153.00   153.00 
  153.00   153.00   153.00   153.00 

[qnn_op_ut, 2067]: duration of ut GGML_OP_ADD using QNN backend ggml: 3 milliseconds
[test-qnn-npu.cpp, qnn_op_ut, 2068]: leave qnn_op_test

How to verify QNN backend or participate in development activity of GGML QNN backend

I provide a dedicated Android command line program and scripts in this PR for purpose of UT on Android device.


 cd tests/ggml-qnn/
./ggml-qnn-ut-build-run.sh  -h              (show usage)
./ggml-qnn-ut-build-run.sh  help            (show usage)
./ggml-qnn-ut-build-run.sh  build           (build Android command line UT program)
./ggml-qnn-ut-build-run.sh  updateqnnlibs   (upload the latest QNN libs to Android phone)
./ggml-qnn-ut-build-run.sh  GGML_OP_ADD  0  (run UT program and verfiy QNN CPU backend on Android phone)
./ggml-qnn-ut-build-run.sh  GGML_OP_ADD  1  (run UT program and verfiy QNN GPU backend on Android phone)
./ggml-qnn-ut-build-run.sh  GGML_OP_ADD  2  (run UT program and verfiy QNN NPU backend on Android phone)
./ggml-qnn-ut-build-run.sh  GGML_OP_ADD  3  (compare performance between QNN backend and original ggml on Android phone)

A suitable/qualified reviewer should/might be familiar with source code of ggml and Qualcomm QNN(Qualcomm Neural Network, aka Qualcomm AI Engine Direct) SDK or other Qualcomm's AI software stack; skillsets including real/hardcore AI tech is more better(add more quantize data type and implement more GGML OPs(or kernels) require the AI skillset) but is not an essential skillset in this PR. some notes for potential qualified reviewer:

  • Programming language detail is not the key-point in this PR also language detail is really important and I will handle it properly as much as possible(this PR follow the coding style in upstream llama.cpp strictly/as much as possible), pls do NOT spent too much time on these language details : such as code format, code align, variable name, function name, unused variable, unused function, compiler warning, C++ grammar/syntax in so-called modern C++11/14/17/20...).
  • PR should/could be submitted in upstream llama.cpp if the PR is for fix issues/bugs in upstream llama.cpp(this is the reason why familiar with source code of ggml is an essential prerequisite for a suitable reviewer).
  • Don't bring too much complex new features in this PR, a MVP(Minimum Viable PR) style PR might be accepted by the maintainers of ggml community.
  • Pls focus on the real keypoint in this PR:how to utilize the Hexagon NPU maximally with the highly well-designed/highly compact ggml machine learning framework.

Any GGML community programmer/developer/AI expert who interesting with the topic of GGML QNN backend can use/extend the dedicated Android command line program to verify GGML QNN backend, review are greatly welcomed and appreciated.

@zhouwg zhouwg force-pushed the qualcomm_qnn_backend_for_ggml branch 3 times, most recently from 59e42f8 to b0c3013 Compare April 24, 2024 10:26
Copy link
Contributor

github-actions bot commented Apr 24, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 540 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8677.33ms p(95)=20035.75ms fails=, finish reason: stop=492 truncated=48
  • Prompt processing (pp): avg=95.63tk/s p(95)=443.17tk/s
  • Token generation (tg): avg=47.46tk/s p(95)=47.64tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=qualcomm_qnn_backend_for_ggml commit=a98a4e999000105b81b472c7b36ff80131d68ef1

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 540 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1718010789 --> 1718011423
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 593.29, 593.29, 593.29, 593.29, 593.29, 747.27, 747.27, 747.27, 747.27, 747.27, 756.49, 756.49, 756.49, 756.49, 756.49, 776.61, 776.61, 776.61, 776.61, 776.61, 836.27, 836.27, 836.27, 836.27, 836.27, 841.05, 841.05, 841.05, 841.05, 841.05, 838.87, 838.87, 838.87, 838.87, 838.87, 859.25, 859.25, 859.25, 859.25, 859.25, 865.08, 865.08, 865.08, 865.08, 865.08, 858.89, 858.89, 858.89, 858.89, 858.89, 883.85, 883.85, 883.85, 883.85, 883.85, 891.89, 891.89, 891.89, 891.89, 891.89, 873.3, 873.3, 873.3, 873.3, 873.3, 893.13, 893.13, 893.13, 893.13, 893.13, 909.48, 909.48, 909.48, 909.48, 909.48, 910.98, 910.98, 910.98, 910.98, 910.98, 911.31, 911.31, 911.31, 911.31, 911.31, 910.69, 910.69, 910.69, 910.69, 910.69, 914.6, 914.6, 914.6, 914.6, 914.6, 928.83, 928.83, 928.83, 928.83, 928.83, 927.37, 927.37, 927.37, 927.37, 927.37, 921.49, 921.49, 921.49, 921.49, 921.49, 925.25, 925.25, 925.25, 925.25, 925.25, 928.15, 928.15, 928.15, 928.15, 928.15, 942.74, 942.74, 942.74, 942.74, 942.74, 924.43, 924.43, 924.43, 924.43, 924.43, 923.95, 923.95, 923.95, 923.95, 923.95, 915.03, 915.03, 915.03, 915.03, 915.03, 911.66, 911.66, 911.66, 911.66, 911.66, 909.5, 909.5, 909.5, 909.5, 909.5, 914.04, 914.04, 914.04, 914.04, 914.04, 911.98, 911.98, 911.98, 911.98, 911.98, 910.75, 910.75, 910.75, 910.75, 910.75, 916.72, 916.72, 916.72, 916.72, 916.72, 926.62, 926.62, 926.62, 926.62, 926.62, 924.55, 924.55, 924.55, 924.55, 924.55, 927.08, 927.08, 927.08, 927.08, 927.08, 921.68, 921.68, 921.68, 921.68, 921.68, 920.82, 920.82, 920.82, 920.82, 920.82, 921.7, 921.7, 921.7, 921.7, 921.7, 922.98, 922.98, 922.98, 922.98, 922.98, 930.8, 930.8, 930.8, 930.8, 930.8, 921.59, 921.59, 921.59, 921.59, 921.59, 897.51, 897.51, 897.51, 897.51, 897.51, 894.98, 894.98, 894.98, 894.98, 894.98, 893.03, 893.03, 893.03, 893.03, 893.03, 895.37, 895.37, 895.37, 895.37, 895.37, 897.77, 897.77, 897.77, 897.77, 897.77, 896.81, 896.81, 896.81, 896.81, 896.81, 899.61, 899.61, 899.61, 899.61, 899.61, 898.83, 898.83, 898.83, 898.83, 898.83, 901.17, 901.17, 901.17, 901.17, 901.17, 890.73, 890.73, 890.73, 890.73, 890.73, 888.87, 888.87, 888.87, 888.87, 888.87, 889.05, 889.05, 889.05, 889.05, 889.05, 889.17, 889.17, 889.17, 889.17, 889.17, 888.29, 888.29, 888.29, 888.29, 888.29, 887.41, 887.41, 887.41, 887.41, 887.41, 888.05, 888.05, 888.05, 888.05, 888.05, 888.97, 888.97, 888.97, 888.97, 888.97, 889.62, 889.62, 889.62]
                    
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 540 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1718010789 --> 1718011423
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 41.93, 41.93, 41.93, 41.93, 41.93, 35.06, 35.06, 35.06, 35.06, 35.06, 27.87, 27.87, 27.87, 27.87, 27.87, 30.27, 30.27, 30.27, 30.27, 30.27, 31.24, 31.24, 31.24, 31.24, 31.24, 31.49, 31.49, 31.49, 31.49, 31.49, 32.61, 32.61, 32.61, 32.61, 32.61, 33.52, 33.52, 33.52, 33.52, 33.52, 33.92, 33.92, 33.92, 33.92, 33.92, 34.15, 34.15, 34.15, 34.15, 34.15, 34.26, 34.26, 34.26, 34.26, 34.26, 33.88, 33.88, 33.88, 33.88, 33.88, 33.24, 33.24, 33.24, 33.24, 33.24, 33.26, 33.26, 33.26, 33.26, 33.26, 31.54, 31.54, 31.54, 31.54, 31.54, 31.03, 31.03, 31.03, 31.03, 31.03, 29.95, 29.95, 29.95, 29.95, 29.95, 29.72, 29.72, 29.72, 29.72, 29.72, 29.96, 29.96, 29.96, 29.96, 29.96, 29.84, 29.84, 29.84, 29.84, 29.84, 29.65, 29.65, 29.65, 29.65, 29.65, 29.74, 29.74, 29.74, 29.74, 29.74, 29.88, 29.88, 29.88, 29.88, 29.88, 30.09, 30.09, 30.09, 30.09, 30.09, 30.2, 30.2, 30.2, 30.2, 30.2, 30.22, 30.22, 30.22, 30.22, 30.22, 30.51, 30.51, 30.51, 30.51, 30.51, 30.47, 30.47, 30.47, 30.47, 30.47, 30.42, 30.42, 30.42, 30.42, 30.42, 30.68, 30.68, 30.68, 30.68, 30.68, 30.77, 30.77, 30.77, 30.77, 30.77, 30.87, 30.87, 30.87, 30.87, 30.87, 31.02, 31.02, 31.02, 31.02, 31.02, 31.2, 31.2, 31.2, 31.2, 31.2, 31.05, 31.05, 31.05, 31.05, 31.05, 31.03, 31.03, 31.03, 31.03, 31.03, 30.8, 30.8, 30.8, 30.8, 30.8, 30.35, 30.35, 30.35, 30.35, 30.35, 30.3, 30.3, 30.3, 30.3, 30.3, 30.55, 30.55, 30.55, 30.55, 30.55, 30.68, 30.68, 30.68, 30.68, 30.68, 30.79, 30.79, 30.79, 30.79, 30.79, 30.73, 30.73, 30.73, 30.73, 30.73, 30.24, 30.24, 30.24, 30.24, 30.24, 29.97, 29.97, 29.97, 29.97, 29.97, 29.37, 29.37, 29.37, 29.37, 29.37, 29.03, 29.03, 29.03, 29.03, 29.03, 29.04, 29.04, 29.04, 29.04, 29.04, 29.1, 29.1, 29.1, 29.1, 29.1, 29.14, 29.14, 29.14, 29.14, 29.14, 29.24, 29.24, 29.24, 29.24, 29.24, 29.27, 29.27, 29.27, 29.27, 29.27, 29.28, 29.28, 29.28, 29.28, 29.28, 29.11, 29.11, 29.11, 29.11, 29.11, 29.13, 29.13, 29.13, 29.13, 29.13, 29.15, 29.15, 29.15, 29.15, 29.15, 29.24, 29.24, 29.24, 29.24, 29.24, 29.36, 29.36, 29.36, 29.36, 29.36, 29.45, 29.45, 29.45, 29.45, 29.45, 29.56, 29.56, 29.56, 29.56, 29.56, 29.62, 29.62, 29.62]
                    

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 540 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1718010789 --> 1718011423
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.19, 0.19, 0.19, 0.19, 0.19, 0.39, 0.39, 0.39, 0.39, 0.39, 0.22, 0.22, 0.22, 0.22, 0.22, 0.18, 0.18, 0.18, 0.18, 0.18, 0.24, 0.24, 0.24, 0.24, 0.24, 0.17, 0.17, 0.17, 0.17, 0.17, 0.1, 0.1, 0.1, 0.1, 0.1, 0.12, 0.12, 0.12, 0.12, 0.12, 0.09, 0.09, 0.09, 0.09, 0.09, 0.16, 0.16, 0.16, 0.16, 0.16, 0.22, 0.22, 0.22, 0.22, 0.22, 0.21, 0.21, 0.21, 0.21, 0.21, 0.26, 0.26, 0.26, 0.26, 0.26, 0.28, 0.28, 0.28, 0.28, 0.28, 0.36, 0.36, 0.36, 0.36, 0.36, 0.28, 0.28, 0.28, 0.28, 0.28, 0.29, 0.29, 0.29, 0.29, 0.29, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.21, 0.21, 0.21, 0.21, 0.21, 0.22, 0.22, 0.22, 0.22, 0.22, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.38, 0.38, 0.38, 0.38, 0.38, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.28, 0.28, 0.28, 0.28, 0.28, 0.12, 0.12, 0.12, 0.12, 0.12, 0.07, 0.07, 0.07, 0.07, 0.07, 0.17, 0.17, 0.17, 0.17, 0.17, 0.21, 0.21, 0.21, 0.21, 0.21, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.31, 0.31, 0.31, 0.31, 0.31, 0.28, 0.28, 0.28, 0.28, 0.28, 0.4, 0.4, 0.4, 0.4, 0.4, 0.22, 0.22, 0.22, 0.22, 0.22, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.1, 0.1, 0.1, 0.1, 0.1, 0.32, 0.32, 0.32, 0.32, 0.32, 0.51, 0.51, 0.51, 0.51, 0.51, 0.61, 0.61, 0.61, 0.61, 0.61, 0.48, 0.48, 0.48, 0.48, 0.48, 0.27, 0.27, 0.27, 0.27, 0.27, 0.23, 0.23, 0.23, 0.23, 0.23, 0.17, 0.17, 0.17, 0.17, 0.17, 0.22, 0.22, 0.22, 0.22, 0.22, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.21, 0.21, 0.21, 0.21, 0.21, 0.27, 0.27, 0.27, 0.27, 0.27, 0.08, 0.08, 0.08, 0.08, 0.08, 0.28, 0.28, 0.28, 0.28, 0.28, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.09, 0.09, 0.09, 0.09, 0.09, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15]
                    
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 540 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1718010789 --> 1718011423
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0]
                    

@Dampfinchen
Copy link

Nice. With competent LLMs getting smaller and more efficient as well as Snapdragon laptops coming soon, it's important to make full use of the AI acceleration these SoCs provide through the Hexagon NPU Cluster.

This will make llama.cpp a robust backend for the future and will lead to power efficient LLMs on the go. Personally, I really can't wait!

@zhouwg
Copy link
Contributor Author

zhouwg commented Apr 24, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 198 iterations 🚀

Expand details for performance related PR only

Nice. With competent LLMs getting smaller and more efficient as well as Snapdragon laptops coming soon, it's important to make full use of the AI acceleration these SoCs provide through the Hexagon NPU Cluster.

This will make llama.cpp a robust backend for the future and will lead to power efficient LLMs on the go. Personally, I really can't wait!

thanks for your comment. this PR is a very initial implementation and could/might/should be a good starting point of Qualcomm's QNN backend for GGML. it's better some domain technical experts from Qualcomm involved in this effort after it's accepted by community. I personally think this PR is also an example of GGML way: try crazy ideas, build wild demos, and push the edge of what’s possible.

another thing, a small and standalone Android example(or re-use the existing Android example in llama.cpp) is needed for purpose of facilitate community developers to participate in develop/verify QNN backend.

@zhouwg zhouwg force-pushed the qualcomm_qnn_backend_for_ggml branch 3 times, most recently from 8ade7aa to f9e1b9a Compare April 25, 2024 04:14
@zhouwg zhouwg mentioned this pull request Apr 25, 2024
@zhouwg zhouwg force-pushed the qualcomm_qnn_backend_for_ggml branch 2 times, most recently from 5abb2e4 to 7a420e1 Compare April 25, 2024 08:11
@zhouwg zhouwg changed the title ggml: add Qualcomm QNN(Qualcomm Neural Network,aka Qualcomm AI Engine Direct) backend ggml-qnn: add Qualcomm QNN(Qualcomm Neural Network,aka Qualcomm AI Engine Direct) backend Apr 25, 2024
@zhouwg zhouwg force-pushed the qualcomm_qnn_backend_for_ggml branch 3 times, most recently from 95a980a to b0c3013 Compare April 25, 2024 09:03
@ggerganov
Copy link
Owner

another thing, a small and standalone Android example(or re-use the existing Android example in llama.cpp) is needed for purpose of facilitate community developers to participate in develop/verify QNN backend.

Yes, it would be useful to have an example or instructions how to run this. In the meantime, simply setting up the test-backend-ops to run with ggml-qnn would be a good start for people who want to implement the missing operators

@zhouwg
Copy link
Contributor Author

zhouwg commented Apr 25, 2024

another thing, a small and standalone Android example(or re-use the existing Android example in llama.cpp) is needed for purpose of facilitate community developers to participate in develop/verify QNN backend.

Yes, it would be useful to have an example or instructions how to run this. In the meantime, simply setting up the test-backend-ops to run with ggml-qnn would be a good start for people who want to implement the missing operators

thanks for your guidance. I'll study how to use test-backend-ops.cpp to validate QNN backend.

@slaren
Copy link
Collaborator

slaren commented Apr 25, 2024

You would need to modify ggml_backend_registry_init to register the backend, then it should be automatically used by test-backend-ops.

GGML_CALL static void ggml_backend_registry_init(void) {

@zhouwg
Copy link
Contributor Author

zhouwg commented Apr 25, 2024

You would need to modify ggml_backend_registry_init to register the backend, then it should be automatically used by test-backend-ops.

GGML_CALL static void ggml_backend_registry_init(void) {

thanks for your help, it's really helpful. I'm working on adapt to test-backend-ops.cpp with QNN backend on Android.

@zhouwg
Copy link
Contributor Author

zhouwg commented Apr 25, 2024

@ggerganov, @slaren, I'm sorry to interrupt you. adapt to test-backend-ops.cpp using QNN backend already done and it works fine as expected on Xiaomi14(Qualcomm SM8650-AB Snapdragon 8 Gen 3).

Could you take a moment to look at it? thanks.

BTW, the design and implementation of test-backend-ops.cpp is really excellent. I never noticed this file/feature before.

BTW, should the README-qnn.md be removed?

@zhouwg zhouwg force-pushed the qualcomm_qnn_backend_for_ggml branch 2 times, most recently from eff9669 to 180ab5f Compare April 25, 2024 15:47
tests/test-backend-ops.cpp Outdated Show resolved Hide resolved
@zhouwg zhouwg force-pushed the qualcomm_qnn_backend_for_ggml branch 4 times, most recently from 992cf05 to 67beeb6 Compare April 26, 2024 02:12
@zhouwg zhouwg force-pushed the qualcomm_qnn_backend_for_ggml branch from 375b5e5 to fdf0272 Compare June 9, 2024 01:06
@zhouwg zhouwg force-pushed the qualcomm_qnn_backend_for_ggml branch 4 times, most recently from dafa5f1 to 3e8b61f Compare June 9, 2024 15:49
@zhouwg zhouwg force-pushed the qualcomm_qnn_backend_for_ggml branch 4 times, most recently from a98a4e9 to d38d4a6 Compare June 10, 2024 12:07
@chraac
Copy link

chraac commented Jun 11, 2024

Thanks for the fix, good job! Now working on running this branch on my phone! Will leave a note here if have any problem!

@zhouwg zhouwg force-pushed the qualcomm_qnn_backend_for_ggml branch 3 times, most recently from 9e1009c to 5f8cfe4 Compare June 11, 2024 15:04
ggml-qnn.cpp Outdated Show resolved Hide resolved
ggml-qnn.cpp Show resolved Hide resolved
@zhouwg zhouwg force-pushed the qualcomm_qnn_backend_for_ggml branch 2 times, most recently from 5a65c86 to 5269e08 Compare June 12, 2024 08:30
@zhouwg zhouwg force-pushed the qualcomm_qnn_backend_for_ggml branch 2 times, most recently from c42d045 to faaa86b Compare June 13, 2024 07:41
qnn_instance * instance = nullptr;
std::string graph_name = "ggml_op_qnn_add";
Qnn_GraphHandle_t graph_handle = nullptr;
Qnn_Tensor_t * tensor_0 = nullptr;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a PR on your fork, to simpilify the binding from Qnn_Tensor_t to ggml_tensor, please have look if have time: zhouwg#2

* mul_mat_f16_f32: src0 is F16 and src1 is F32.
* mul_mat_q_f32: src0 is quantized (Q4_0, Q4_1, ...), and src1 is F32.
*/
static void ggml_qnn_mul_mat(ggml_backend_qnn_context * ctx,
Copy link

@chraac chraac Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also find a maybe bug on this branch when trying to do mulmat with gpu backend on my 8 Gen2 phone, commandline:
ggml-qnn-ut -t GGML_OP_MUL_MAT -b 1

image
As you can see it generate a wrong dst matrix.

When running with cpu backend, the result is correct:
image

Copy link

@chraac chraac Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks the graphExecute failed with error 6004. maybe we can use it to find the root cause here

Copy link

@chraac chraac Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to reproduce, you could use my patch to constant initialize the test tensor:

llama.cpp-5e18cdc-init the test array with const values.patch

just change the tensor init in the unit test so that we can reproduce it more easily

@myan-o
Copy link

myan-o commented Jun 18, 2024

i'm tred build in termux.
Can't you change the path of /data/local/tmp?
The Skel.so path cannot be changed in NPU and loading fails.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
devops improvements to build systems and github actions enhancement New feature or request ggml changes relating to the ggml tensor library for machine learning Qualcomm QNN Qualcomm's QNN(AI Direct Engine) SDK Review Complexity : High Generally require indepth knowledge of LLMs or GPUs testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet