You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While running FP32 and INT8 Quantized Models through the OneAPI stack on CPUs, we observe SGEMM kernels being called from MKL. The number of calls for SGEMM kernel calls and the individual timing of each kernel on both the instances are comparable, with quantized model doing slightly better than FP32 model. In such a scenario, how are we seeing the advantages kicking in with quantized models?
Where are the quantized kernels being called, and why are MKL SGEMM calls the same in both cases? MKL SGEMM has only FP32 support, correct me here.
Any resources/comments that talk about the quantization support and its merits/demerits from a CPU perspective.
Is llama-bench an ideal way to benchmark LLMs to get throughput? What scripts do you suggest for getting first token latency, total latency for 60 tokens and next token latency.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
While running FP32 and INT8 Quantized Models through the OneAPI stack on CPUs, we observe SGEMM kernels being called from MKL. The number of calls for SGEMM kernel calls and the individual timing of each kernel on both the instances are comparable, with quantized model doing slightly better than FP32 model. In such a scenario, how are we seeing the advantages kicking in with quantized models?
PS: Logs attached for reference!
fp32_mkl_logs.txt
q8_mkl_logs.txt
Beta Was this translation helpful? Give feedback.
All reactions