Replies: 3 comments
-
So, my concept is that We already have When we go to 5 bpw, there is very little difference between At 6 bpw, basically nothing I have tried is better than Anyone else apart from @Nexesenex interested in such additional quants? |
Beta Was this translation helpful? Give feedback.
-
The point of having a maximum amount of GGML_Types available is for interested folks to be able to make their own quantization strategies, tensor by tensor. Why not just share the GGML_Types you created and which are working as intended with the community, even if you don't elaborate new quant strategies based on them, instead of leaving that amazing work to sleep on your private repo? Trust the folks trying to use them to determine if they are useful or not. Only by exploring can we find the combinations which work best, and more people exploring multiplicate the chances. I didn't find an aleady all-made formula determining the interaction between the 9 different kind of tensors of a model like Llama 2 and the ideal quantization strategy at a given overall bpw, so trial and error it is. But your IQ1_S quant strategy can already be improved with the available GGML_types, and a 1Q1_M GGML_type at 1.8125bpw would allow to make 2.0x bpw and sub 2bpw overall quant (for 34b+ models, and MOEs) which are actually usable beyond a demo, this without toying too much with sketchy combos of ffn tensors quantized in part in IQ2_XXS and in part with IQ1_S. As for IQ2_M, we have a quant strategy named like that indeed, but not a GGML_Type IQ2_M. IQ2_M is based on GGML_Type IQ2_S (2.5625bpw).
|
Beta Was this translation helpful? Give feedback.
-
@Nexesenex Why don't we start by you sharing with us this improvement? |
Beta Was this translation helpful? Give feedback.
-
@ikawrakow,
I don't know the amount of work you'd need to do what follows if you didn't do it already on your private repo, but I think that it would be great to have a 1.8125bpw (and maybe a 2.8125bpw) GGML type, in order to improve the granularity of the FFN & attn.q.weight tensors quantization, and establish refined strategies for sub-IQ2_XXS and sub-IQ3_XXS quantized models. Beyond 3bpw, higher bpw intervals between the GGML types are less "problematic".
I don't know where the mathematical breaking of catastrophic quality loss is reached (you already obviously lowered it below 1.5bpw with your IQ1_S_"EvenBetter" GGML type), but the attn.q.weight and ffn tensors (notably .up and .gate) might even be able to endure a sub 1.5bpw quant while allowing still a quant strategy using it to remain on the same "curve" as the one presented on Artefact's graph.
Also, and considering the massive improvements your IQ quants have brought in term of quality/size ratio, higher IQ GGML types in the 4.5 - 6bpw range are highly awaited, especially for the attn.v.weight, attn.output.weight, ffn.down, and output tensors.
Beta Was this translation helpful? Give feedback.
All reactions