Your GPU is probably not used at all, which would explain the slow speed in answering. #750

thomasmeneghelli · 2024-02-18T23:04:16Z

Please help me to configure BLAS=1, on RTX 3070, WIN 11.
I have llama-cpp-python==0.2.23 --no-cache-dir

Thank you so much
·························

Your GPU is probably not used at all, which would explain the slow speed in answering.

You are using a 7 billion parameter model without quantization, which means that with 16 bit weights ( = 2 byte), your model is 14 GB in size.

As your GPU only has 6 GB it will probably not be useful for any reasonable model.

For example, I have a 3070 with 8 GB and even with the 2-bit quantized version (which probably has a very low quality) of a 7 billion parameter model I run out of GPU RAM due to cuBLAS requiring extra space.

Originally posted by @KonradHoeffner in #231 (comment)

TechInnovate01 · 2024-03-01T12:55:01Z

Just for clarity, GGUF models are quantized models and are meant to run on CPU and memory. if you want to use the model over the GPU you must select HF (Huggingface)models in this code which requires your account and HF token for login while downloading the model (first time only).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Your GPU is probably not used at all, which would explain the slow speed in answering. #750

Your GPU is probably not used at all, which would explain the slow speed in answering. #750

thomasmeneghelli commented Feb 18, 2024

TechInnovate01 commented Mar 1, 2024

Your GPU is probably not used at all, which would explain the slow speed in answering. #750

Your GPU is probably not used at all, which would explain the slow speed in answering. #750

Comments

thomasmeneghelli commented Feb 18, 2024

TechInnovate01 commented Mar 1, 2024