perplexity: add BF16 vs. FP16 results

ggerganov · May 9, 2024 · aa0296e · aa0296e
1 parent 83330d8
commit aa0296e
Showing 1 changed file with 41 additions and 1 deletion.
diff --git a/examples/perplexity/README.md b/examples/perplexity/README.md
@@ -32,7 +32,7 @@ In addition to the KL divergence the following statistics are calculated with `-
 
 ## LLaMA 3 8b Scoreboard
 
-Results are sorted by Kullback-Leibler divergence relative to FP16.
+Results were generated using the CUDA backend and are sorted by Kullback-Leibler divergence relative to FP16.
 The "WT" importance matrices were created using varying numbers of Wikitext tokens and can be found [here](https://huggingface.co/JohannesGaessler/llama.cpp_importance_matrices/blob/main/imatrix-llama_3-8b-f16-2.7m_tokens.dat).
 
 | Quantization | imatrix | Model size [GiB] | PPL | ΔPPL | KLD | Mean Δp | RMS Δp |
@@ -89,6 +89,8 @@ K-quants score better on mean Δp than the legacy quants than e.g. KL divergence
 
 ## LLaMA 2 vs. LLaMA 3 Quantization comparison
 
+Results were generated using the CUDA backend.
+
 | Metric | L2 7b q2_K | L3 8b q2_K | L2 7b q4_K_M | L3 8b q4_K_M | L2 7b q6_K | L3 8b q6_K | L2 7b q8_0 | L3 8b q8_0 |
 |-----------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|---------------------|
 | Mean PPL | 5.794552 ± 0.032298 | 9.751568 ± 0.063312 | 5.877078 ± 0.032781 | 6.407115 ± 0.039119 | 5.808494 ± 0.032425 | 6.253382 ± 0.038078 | 5.798542 ± 0.032366 | 6.234284 ± 0.037878 |
@@ -107,6 +109,44 @@ K-quants score better on mean Δp than the legacy quants than e.g. KL divergence
 | RMS Δp | 9.762 ± 0.053 % | 21.421 ± 0.079 % | 3.252 ± 0.024 % | 5.519 ± 0.050 % | 1.339 ± 0.010 % | 2.295 ± 0.019 % | 0.618 ± 0.011 % | 1.198 ± 0.007 % |
 | Same top p | 85.584 ± 0.086 % | 71.138 ± 0.119 % | 94.665 ± 0.055 % | 91.901 ± 0.072 % | 97.520 ± 0.038 % | 96.031 ± 0.051 % | 98.846 ± 0.026 % | 97.674 ± 0.040 % |
 
+## LLaMA 3 BF16 vs. FP16 comparison
+
+Results were generated using the CPU backend with LLaMA 3 8b BF16 as `--kl-divergence-base` and LLaMA 3 8b FP16 as the `--model` for comparison.
+
+| Metric | Value |
+|--------------------------------|--------------------------|
+| Mean PPL(Q) | 6.227711 ± 0.037833 |
+| Mean PPL(base) | 6.225194 ± 0.037771 |
+| Cor(ln(PPL(Q)), ln(PPL(base))) | 99.990% |
+| Mean ln(PPL(Q)/PPL(base)) | 0.000404 ± 0.000086 |
+| Mean PPL(Q)/PPL(base) | 1.000404 ± 0.000086 |
+| Mean PPL(Q)-PPL(base) | 0.002517 ± 0.000536 |
+| Mean KLD | 0.00002515 ± 0.00000020 |
+| Maximum KLD | 0.012206 |
+| 99.9% KLD | 0.000799 |
+| 99.0% KLD | 0.000222 |
+| 99.0% KLD | 0.000222 |
+| Median KLD | 0.000013 |
+| 10.0% KLD | -0.000002 |
+| 5.0% KLD | -0.000008 |
+| 1.0% KLD | -0.000023 |
+| Minimum KLD | -0.000059 |
+| Mean Δp | -0.0000745 ± 0.0003952 % |
+| Maximum Δp | 4.186% |
+| 99.9% Δp | 1.049% |
+| 99.0% Δp | 0.439% |
+| 95.0% Δp | 0.207% |
+| 90.0% Δp | 0.125% |
+| 75.0% Δp | 0.029% |
+| Median Δp | 0.000% |
+| 25.0% Δp | -0.030% |
+| 10.0% Δp | -0.126% |
+| 5.0% Δp | -0.207% |
+| 1.0% Δp | -0.434% |
+| 0.1% Δp | -1.016% |
+| Minimum Δp | -4.672% |
+| RMS Δp | 0.150 ± 0.001 % |
+| Same top p | 99.739 ± 0.013 % |
 
 ## Old Numbers