You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please help me to configure BLAS=1, on RTX 3070, WIN 11.
I have llama-cpp-python==0.2.23 --no-cache-dir
Thank you so much
·························
Your GPU is probably not used at all, which would explain the slow speed in answering.
You are using a 7 billion parameter model without quantization, which means that with 16 bit weights ( = 2 byte), your model is 14 GB in size.
As your GPU only has 6 GB it will probably not be useful for any reasonable model.
For example, I have a 3070 with 8 GB and even with the 2-bit quantized version (which probably has a very low quality) of a 7 billion parameter model I run out of GPU RAM due to cuBLAS requiring extra space.
Just for clarity, GGUF models are quantized models and are meant to run on CPU and memory. if you want to use the model over the GPU you must select HF (Huggingface)models in this code which requires your account and HF token for login while downloading the model (first time only).
Please help me to configure BLAS=1, on RTX 3070, WIN 11.
I have llama-cpp-python==0.2.23 --no-cache-dir
Thank you so much
·························
Your GPU is probably not used at all, which would explain the slow speed in answering.
You are using a 7 billion parameter model without quantization, which means that with 16 bit weights ( = 2 byte), your model is 14 GB in size.
As your GPU only has 6 GB it will probably not be useful for any reasonable model.
For example, I have a 3070 with 8 GB and even with the 2-bit quantized version (which probably has a very low quality) of a 7 billion parameter model I run out of GPU RAM due to cuBLAS requiring extra space.
Originally posted by @KonradHoeffner in #231 (comment)
The text was updated successfully, but these errors were encountered: