New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assertion failure on quantization of Meta-Llama-3-70B-Instruct from f16 to various quantization types. #7215
Comments
Actually, let's also look into the reason why the original (i.e. created by
I expected to see the |
Use the python3.12 ~/Software/AI/llama.cpp/convert-hf-to-gguf.py Meta-Llama-3-70B-Instruct/ --outfile Meta-Llama-3-70B-Instruct.f16.gguf --outtype f16 |
Thank you, I tried that just now and it failed with the following error:
|
Note that there is no
|
This is the complete log.txt with the stdout + stderr of the above invocation of |
@compilade Mind taking a look at the
|
@tigran123 It looks like model parts from |
Oh dear, I am so sorry -- I should have noticed that! Oh, I am so embarrassed, I must be really getting old, to miss such a trivial thing... I will download the missing files and will report if there are any problems with conversion or/and quantizing. |
Just to confirm that @compilade was absolutely correct -- after downloading the required files everything worked correctly -- the f16 model was generated, failed to load because it required 138GB and I only had 128GB RAM and so I quantised it to Q8_0 which loaded just fine. So this issue can be closed. |
First I have downloaded
meta-llama/Meta-Llama-3-70B-Instruct
model from HF. Then I converted it usingconvert.py
script fromllama.cpp
to f16, like this:This worked fine and produced a 108GB file. Unfortunately, I could not load it in my server, because it only has 128GB RAM and RTX 2080 Ti with 11GB VRAM, so there was no way to load it either with or without
-ngl
option. So, I converted the original HF files toQ8_0
instead (again usingconvert.py
) and it also could not be loaded. Then I decided to quantize the f16.gguf
file using thequantize
utility fromllama.cpp
and this is where the problems started. I naturally started from the highest qualityQ6_K
:Then I tried
Q5_K_M
(omitting number of threads, which made no difference):And so on, I tried a few more types, which all failed likewise:
The version of llama.cpp is very recent -- cloned yesterday evening.
The text was updated successfully, but these errors were encountered: