Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml:add new member in GGML's internal data structure #2073

Closed
wants to merge 2 commits into from

Conversation

zhouwg
Copy link
Contributor

@zhouwg zhouwg commented Apr 17, 2024

Purpose

this PR is for multi purpose:

(1) borrow some advantages from Qualcomm's QNN(Qualcomm Neural Network, aka Qualcomm AI Engine Direct) SDK, it's a highly well-designed and concise SDK(the API in QNN SDK is really match with GGML's existing design(including existing backend design)).

this PR borrow the "rank" member from QNN SDK and re-use the existing code as much as possible and not brings side-effect or complexity to existing code

(2) borrow some advantages from PyTorch(the user could specify whether a GGML OP(such as mulmat) is accelerated by a specify backend)

this PR borrow the "use_hwaccel" member from PyTorch and re-use the existing code as much as possible and not brings side-effect or complexity to existing code

(3) cover more scenarios from upper layer code(see section "Explanation"), not including using multiple backends simultaneously because that's another topic/scenario

(4) (the main purpose is) prepare for submit Qualcomm's QNN(Qualcomm Neural Network, aka Qualcomm AI Engine Direct) backend to upstream GGML community (the first step in whisper.cpp, then in llama.cpp): PoC: Add Qualcomm mobile SoC native backend for GGML

Status

this PR has verified/validated in whisper.cpp + QNN backend on different Android phone(Qualcomm SoC based low-end phone and Qualcomm SoC based high-end phone).

Explanation

this PR is useful/helpful/meaningful for some scenarios.

in the fact, the "gpu_device" in struct whisper_context_params is similar to use_hwaccel semantically. there are 2 * n combinations here: 2(use_gpu:true/false) * n (backend_device:1-n, attention here: a DSP backend is also considered as a "gpu_device", because we should re-use the existing code as much as possible and not brings side-effect or complexity to existing code)

   struct whisper_context_params {
        bool  use_gpu;
        int   gpu_device;  // CUDA device

        // [EXPERIMENTAL] Token-level timestamps with DTW
        bool dtw_token_timestamps;
        enum whisper_alignment_heads_preset dtw_aheads_preset;

        int dtw_n_top;
        struct whisper_aheads dtw_aheads;

        size_t dtw_mem_size; // TODO: remove
    };

so a special value of "gpu_device" could be considered/assumed non hardware acceleration or fall into the original default GGML CPU backend.

accordingly, we can re-use the existing "backend" member in core data structure with a new member "use_hwaccel" in struct ggml_context. btw, I personally think we should not remove the existing "backend" from "struct ggml_tensor" although there is a plan to remove this member:

  // n-dimensional tensor
    struct ggml_tensor {
        enum ggml_type         type;
        enum ggml_backend_type backend;

        struct ggml_backend_buffer * buffer;

        int64_t ne[GGML_MAX_DIMS]; // number of elements
        size_t  nb[GGML_MAX_DIMS]; // stride in bytes:
                                   // nb[0] = ggml_type_size(type)
                                   // nb[1] = nb[0]   * (ne[0] / ggml_blck_size(type)) + padding
                                   // nb[i] = nb[i-1] * ne[i-1]

        // compute data
        enum ggml_op op;

        // op params - allocated as int32_t for alignment
        int32_t op_params[GGML_MAX_OP_PARAMS / sizeof(int32_t)];

        int32_t flags;

        struct ggml_tensor * grad;
        struct ggml_tensor * src[GGML_MAX_SRC];

        // performance
        int     perf_runs;
        int64_t perf_cycles;
        int64_t perf_time_us;

        struct ggml_tensor * view_src;
        size_t               view_offs;

        void * data;

        char name[GGML_MAX_NAME];

        void * extra; // extra things e.g. for ggml-cuda.cu

        int32_t rank;

        char padding[20];
    };

the reason for this is that there are some different scenarios(not including using multiple backends simultaneously, that's another topic/scenario) which "use_gpu&gpu_device" cannot cover:

I personally think these new members("use_hwaccel" in struct ggml_context & "rank" in struct ggml_tensor) are not redundant and these new members will NOT bring side-effect to existing codes. Of course, I understand we should not bring too much "useful codes" into existing implementation of GGML internal and we should keep GGML as compact/clean as possible, so this PR reuses existing code to the maximum extent or as much as possible:

https://github.com/zhouwg/whisper.cpp/blob/add_hwaccel_in_data_structure/ggml.c#L2995

PR approval request

@hey-shashikant, thanks for your time and approval of #2054 (which is same to this PR), could you help to take a look this PR? thanks so much.

@slaren, I'm sorry to interrupt you, could you help to take a look? thanks

@ggerganov , I'm sorry to interrupt you, could you help to take a look? thanks

@zhouwg zhouwg changed the title ggml:add hwaccel in data structure ggml:add use_hwaccel in data structure Apr 17, 2024
@zhouwg zhouwg changed the title ggml:add use_hwaccel in data structure ggml:add new member in data structure Apr 20, 2024
@zhouwg zhouwg changed the title ggml:add new member in data structure ggml:add new member in GGML's internal data structure Apr 20, 2024
@slaren
Copy link
Collaborator

slaren commented Apr 20, 2024

We have a mechanism in place for using multiple backends simultaneously with ggml_backend_sched. Have you looked into this?

@zhouwg
Copy link
Contributor Author

zhouwg commented Apr 20, 2024

ggml_backend_sched

thanks for your comment. I'll try it whether ggml_backend_sched can cover the above scenarios. I think the method in this PR is more clear/straight before I study more details in ggml_backend_shed

@slaren
Copy link
Collaborator

slaren commented Apr 20, 2024

Currently it has some limitations that I suspect would prevent using it with backends that only implement a small subset of operations. However, it was designed from the first moment to support this case. There is a work in progress in ggerganov/llama.cpp#6210 that uses the supports_op function of the backends. Feel free to continue working on that if you find it interesting for your use case, at the moment it is not a priority for me.

@zhouwg
Copy link
Contributor Author

zhouwg commented Apr 20, 2024

Currently it has some limitations that I suspect would prevent using it with backends that only implement a small subset of operations. However, it was designed from the first moment to support this case. There is a work in progress in ggerganov/llama.cpp#6210 that uses the supports_op function of the backends. Feel free to continue working on that if you find it interesting for your use case, at the moment it is not a priority for me.

ok, I see. but I think you might change your mind if you try/validate this PR on a Qualcomm SoC based phone. at the same time, I personally think the method in this PR re-use the existing codes as much as possible and it's simple/clear and can cover more scenarios in a straight way. the ggml_backend_sched is an excellent idea but the design of ggml_backend_sched might-be brings more complicated things into GGML internal: the API designer/maintainer should not do much work what should be done in backend's implementation and I think the existing highly well-designed ggml_backend subsystem is enough at the moment.

for example, the upper layer code just want to test GGML_OP_MUL_MAT using different backend(including the default GGML CPU backend), the excellent but complicated ggml_backend_sched might be not suitable/proper for this scenario.

thanks for your comment.

@zhouwg
Copy link
Contributor Author

zhouwg commented Apr 20, 2024

there is a issue in Github CI, Can someone from CI team take a look at this issue? thanks.

Screenshot from 2024-04-20 12-01-39

@zhouwg zhouwg closed this Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants