ggml:add new member in GGML's internal data structure #2073

zhouwg · 2024-04-17T15:03:21Z

Purpose

this PR is for multi purpose:

(1) borrow some advantages from Qualcomm's QNN(Qualcomm Neural Network, aka Qualcomm AI Engine Direct) SDK, it's a highly well-designed and concise SDK(the API in QNN SDK is really match with GGML's existing design(including existing backend design)).

this PR borrow the "rank" member from QNN SDK and re-use the existing code as much as possible and not brings side-effect or complexity to existing code

(2) borrow some advantages from PyTorch(the user could specify whether a GGML OP(such as mulmat) is accelerated by a specify backend)

this PR borrow the "use_hwaccel" member from PyTorch and re-use the existing code as much as possible and not brings side-effect or complexity to existing code

(3) cover more scenarios from upper layer code(see section "Explanation"), not including using multiple backends simultaneously because that's another topic/scenario

(4) (the main purpose is) prepare for submit Qualcomm's QNN(Qualcomm Neural Network, aka Qualcomm AI Engine Direct) backend to upstream GGML community (the first step in whisper.cpp, then in llama.cpp): PoC: Add Qualcomm mobile SoC native backend for GGML

Status

this PR has verified/validated in whisper.cpp + QNN backend on different Android phone(Qualcomm SoC based low-end phone and Qualcomm SoC based high-end phone).

Explanation

this PR is useful/helpful/meaningful for some scenarios.

in the fact, the "gpu_device" in struct whisper_context_params is similar to use_hwaccel semantically. there are 2 * n combinations here: 2(use_gpu:true/false) * n (backend_device:1-n, attention here: a DSP backend is also considered as a "gpu_device", because we should re-use the existing code as much as possible and not brings side-effect or complexity to existing code)

   struct whisper_context_params {
        bool  use_gpu;
        int   gpu_device;  // CUDA device

        // [EXPERIMENTAL] Token-level timestamps with DTW
        bool dtw_token_timestamps;
        enum whisper_alignment_heads_preset dtw_aheads_preset;

        int dtw_n_top;
        struct whisper_aheads dtw_aheads;

        size_t dtw_mem_size; // TODO: remove
    };

so a special value of "gpu_device" could be considered/assumed non hardware acceleration or fall into the original default GGML CPU backend.

accordingly, we can re-use the existing "backend" member in core data structure with a new member "use_hwaccel" in struct ggml_context. btw, I personally think we should not remove the existing "backend" from "struct ggml_tensor" although there is a plan to remove this member:

  // n-dimensional tensor
    struct ggml_tensor {
        enum ggml_type         type;
        enum ggml_backend_type backend;

        struct ggml_backend_buffer * buffer;

        int64_t ne[GGML_MAX_DIMS]; // number of elements
        size_t  nb[GGML_MAX_DIMS]; // stride in bytes:
                                   // nb[0] = ggml_type_size(type)
                                   // nb[1] = nb[0]   * (ne[0] / ggml_blck_size(type)) + padding
                                   // nb[i] = nb[i-1] * ne[i-1]

        // compute data
        enum ggml_op op;

        // op params - allocated as int32_t for alignment
        int32_t op_params[GGML_MAX_OP_PARAMS / sizeof(int32_t)];

        int32_t flags;

        struct ggml_tensor * grad;
        struct ggml_tensor * src[GGML_MAX_SRC];

        // performance
        int     perf_runs;
        int64_t perf_cycles;
        int64_t perf_time_us;

        struct ggml_tensor * view_src;
        size_t               view_offs;

        void * data;

        char name[GGML_MAX_NAME];

        void * extra; // extra things e.g. for ggml-cuda.cu

        int32_t rank;

        char padding[20];
    };

the reason for this is that there are some different scenarios(not including using multiple backends simultaneously, that's another topic/scenario) which "use_gpu&gpu_device" cannot cover:

the combination of "use_gpu&gpu_device in struct whisper_context_params" is used for whisper.cpp(a specify backend(including the default GGML CPU backend) will be used accordingly) ,
the combination of "use_hwaccel in struct ggml_context" & "backend in struct ggml_tensor" is used for other scenarios,
- for example: UT for GGML OPs using a specify backend
- for example: whisper bechmark using a specify backend
- for example: delegate all GGML OPs from default GGML CPU backend to a specify backend or toggle between different backend(including the default GGML CPU backend)

I personally think these new members("use_hwaccel" in struct ggml_context & "rank" in struct ggml_tensor) are not redundant and these new members will NOT bring side-effect to existing codes. Of course, I understand we should not bring too much "useful codes" into existing implementation of GGML internal and we should keep GGML as compact/clean as possible, so this PR reuses existing code to the maximum extent or as much as possible:

https://github.com/zhouwg/whisper.cpp/blob/add_hwaccel_in_data_structure/ggml.c#L2995

PR approval request

@hey-shashikant, thanks for your time and approval of #2054 (which is same to this PR), could you help to take a look this PR? thanks so much.

@slaren, I'm sorry to interrupt you, could you help to take a look? thanks

@ggerganov , I'm sorry to interrupt you, could you help to take a look? thanks

slaren · 2024-04-20T02:13:41Z

We have a mechanism in place for using multiple backends simultaneously with ggml_backend_sched. Have you looked into this?

zhouwg · 2024-04-20T02:49:05Z

ggml_backend_sched

thanks for your comment. I'll try it whether ggml_backend_sched can cover the above scenarios. I think the method in this PR is more clear/straight before I study more details in ggml_backend_shed

slaren · 2024-04-20T02:54:06Z

Currently it has some limitations that I suspect would prevent using it with backends that only implement a small subset of operations. However, it was designed from the first moment to support this case. There is a work in progress in ggerganov/llama.cpp#6210 that uses the supports_op function of the backends. Feel free to continue working on that if you find it interesting for your use case, at the moment it is not a priority for me.

zhouwg · 2024-04-20T02:58:55Z

Currently it has some limitations that I suspect would prevent using it with backends that only implement a small subset of operations. However, it was designed from the first moment to support this case. There is a work in progress in ggerganov/llama.cpp#6210 that uses the supports_op function of the backends. Feel free to continue working on that if you find it interesting for your use case, at the moment it is not a priority for me.

ok, I see. but I think you might change your mind if you try/validate this PR on a Qualcomm SoC based phone. at the same time, I personally think the method in this PR re-use the existing codes as much as possible and it's simple/clear and can cover more scenarios in a straight way. the ggml_backend_sched is an excellent idea but the design of ggml_backend_sched might-be brings more complicated things into GGML internal: the API designer/maintainer should not do much work what should be done in backend's implementation and I think the existing highly well-designed ggml_backend subsystem is enough at the moment.

for example, the upper layer code just want to test GGML_OP_MUL_MAT using different backend(including the default GGML CPU backend), the excellent but complicated ggml_backend_sched might be not suitable/proper for this scenario.

thanks for your comment.

zhouwg · 2024-04-20T04:02:08Z

there is a issue in Github CI, Can someone from CI team take a look at this issue? thanks.

ggml: add hw_accel in data structure

e75c029

zhouwg mentioned this pull request Apr 17, 2024

ggml: add hw_accel in data structure #2054

Closed

zhouwg changed the title ~~ggml:add hwaccel in data structure~~ ggml:add use_hwaccel in data structure Apr 17, 2024

zhouwg changed the title ~~ggml:add use_hwaccel in data structure~~ ggml:add new member in data structure Apr 20, 2024

zhouwg changed the title ~~ggml:add new member in data structure~~ ggml:add new member in GGML's internal data structure Apr 20, 2024

zhouwg mentioned this pull request Apr 21, 2024

ggml: add new member in GGML's internal data structure ggerganov/llama.cpp#6815

Closed

zhouwg force-pushed the add_hwaccel_in_data_structure branch 2 times, most recently from 7cedf6c to e75c029 Compare April 22, 2024 00:01

make github CI happy

d648e0f

zhouwg closed this Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml:add new member in GGML's internal data structure #2073

ggml:add new member in GGML's internal data structure #2073

zhouwg commented Apr 17, 2024 •

edited

slaren commented Apr 20, 2024

zhouwg commented Apr 20, 2024 •

edited

slaren commented Apr 20, 2024

zhouwg commented Apr 20, 2024 •

edited

zhouwg commented Apr 20, 2024 •

edited

ggml:add new member in GGML's internal data structure #2073

ggml:add new member in GGML's internal data structure #2073

Conversation

zhouwg commented Apr 17, 2024 • edited

Purpose

Status

Explanation

PR approval request

slaren commented Apr 20, 2024

zhouwg commented Apr 20, 2024 • edited

slaren commented Apr 20, 2024

zhouwg commented Apr 20, 2024 • edited

zhouwg commented Apr 20, 2024 • edited

zhouwg commented Apr 17, 2024 •

edited

zhouwg commented Apr 20, 2024 •

edited

zhouwg commented Apr 20, 2024 •

edited

zhouwg commented Apr 20, 2024 •

edited