Understanding the Quantization Support in llama.cpp #6997

lalith1403 · 2024-04-30T05:56:47Z

lalith1403
Apr 30, 2024

While running FP32 and INT8 Quantized Models through the OneAPI stack on CPUs, we observe SGEMM kernels being called from MKL. The number of calls for SGEMM kernel calls and the individual timing of each kernel on both the instances are comparable, with quantized model doing slightly better than FP32 model. In such a scenario, how are we seeing the advantages kicking in with quantized models?

Where are the quantized kernels being called, and why are MKL SGEMM calls the same in both cases? MKL SGEMM has only FP32 support, correct me here.
Any resources/comments that talk about the quantization support and its merits/demerits from a CPU perspective.
Is llama-bench an ideal way to benchmark LLMs to get throughput? What scripts do you suggest for getting first token latency, total latency for 60 tokens and next token latency.

PS: Logs attached for reference!

fp32_mkl_logs.txt
q8_mkl_logs.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding the Quantization Support in llama.cpp #6997

{{title}}

Replies: 0 comments

Select a reply

Understanding the Quantization Support in llama.cpp #6997

lalith1403 Apr 30, 2024

Replies: 0 comments

lalith1403
Apr 30, 2024