Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Your GPU is probably not used at all, which would explain the slow speed in answering. #750

Open
thomasmeneghelli opened this issue Feb 18, 2024 · 1 comment

Comments

@thomasmeneghelli
Copy link

Please help me to configure BLAS=1, on RTX 3070, WIN 11.
I have llama-cpp-python==0.2.23 --no-cache-dir

Thank you so much
·························

Your GPU is probably not used at all, which would explain the slow speed in answering.

You are using a 7 billion parameter model without quantization, which means that with 16 bit weights ( = 2 byte), your model is 14 GB in size.

As your GPU only has 6 GB it will probably not be useful for any reasonable model.

For example, I have a 3070 with 8 GB and even with the 2-bit quantized version (which probably has a very low quality) of a 7 billion parameter model I run out of GPU RAM due to cuBLAS requiring extra space.

Originally posted by @KonradHoeffner in #231 (comment)

@TechInnovate01
Copy link

Just for clarity, GGUF models are quantized models and are meant to run on CPU and memory. if you want to use the model over the GPU you must select HF (Huggingface)models in this code which requires your account and HF token for login while downloading the model (first time only).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants