New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml-cuda.so is 90mb with -arch=all #7156
Comments
I assume this is simply a typo and you mean 90mb.
When we (slaren, a user, and me) tested compiling for different CUDA architectures (months ago) we found that there is no measurable performance difference between compiling for the minimum needed CUDA architecture and the actual CUDA arch of the GPU. So assuming you use CUDA 12 it should be sufficient to compile for CUDA architectures 5.2, 6.0, 6.1, and 7.0 with the current code.
The reasons why the FlashAttention kernel needs so much space are because
The first reason is I think fundamentally unavoidable. The second reason can only be avoided if you accept a significant performance penalty or reduce the number of cases covered by the kernel. Intuitively I would think that a kernel without templating would be at least 2x slower. What you could do on your end to reduce the file size without performance penalties is to compile the kernel only for the head size of the model with which you package the code; all other head sizes are never going to be used anyways. In a similar manner you could compile only those kernels for quantized data that match the quantization format of the packaged model to reduce the file size for |
The CUDA implementation for
GGML_OP_FLASH_ATTN_EXT
is as large as the rest of ggml-cuda combined.The heaviest function is this one:
llama.cpp/ggml-cuda/fattn.cu
Lines 192 to 196 in 4426e29
GPU support for flash attention can't be included in llamafile because we deal with a 4GB limit on Windows.
For comparison, in December ggml-cuda.so built with
-march=all
was 12mb. By February is was 16mb. By April it was 50mb. Now it's 90gb. On my project we've already started using gzip to compress the ggml-cuda dso. We've also reduced our support vector to-arch=all-major
. Everything that can be done is being done on our end, since I'd like to be able to include everything if possible. However this op seems like it could benefit from a refactoring.The text was updated successfully, but these errors were encountered: