Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perplexity: add BF16 vs. FP16 results #7150

Merged

Conversation

JohannesGaessler
Copy link
Collaborator

@JohannesGaessler JohannesGaessler commented May 8, 2024

This PR adds perplexity results for BF16 vs. FP16 precision for LLaMA 3. Findings so far:

  • There is no measurable difference for Mean Δp, i.e. the average probability of the models to correctly predict the next token at 1.0 temperature. The issue here is that the prints only have a precision of 0.0001%; I'll re-run the calculation with a patched print. Mean Δp is -0.000075 +- 0.000395 %.
  • The change in token probabilities between FP16 and BF16 is ~10x smaller than the change between FP16 and q8_0.
  • The increase in perplexity going from BF16 to FP16 is very small but definitely statistically significant with almost $5 \sigma$ if you neglect systematic uncertainties between BF16 and FP16 (I think at this level of precision they will be extremely difficult to quantify).
Metric Value
Mean PPL(Q) 6.227711 ± 0.037833
Mean PPL(base) 6.225194 ± 0.037771
Cor(ln(PPL(Q)), ln(PPL(base))) 99.990%
Mean ln(PPL(Q)/PPL(base)) 0.000404 ± 0.000086
Mean PPL(Q)/PPL(base) 1.000404 ± 0.000086
Mean PPL(Q)-PPL(base) 0.002517 ± 0.000536
Mean KLD 0.00002515 ± 0.00000020
Maximum KLD 0.012206
99.9% KLD 0.000799
99.0% KLD 0.000222
99.0% KLD 0.000222
Median KLD 0.000013
10.0% KLD -0.000002
5.0% KLD -0.000008
1.0% KLD -0.000023
Minimum KLD -0.000059
Mean Δp -0.0000745 ± 0.0003952 %
Maximum Δp 4.186%
99.9% Δp 1.049%
99.0% Δp 0.439%
95.0% Δp 0.207%
90.0% Δp 0.125%
75.0% Δp 0.029%
Median Δp 0.000%
25.0% Δp -0.030%
10.0% Δp -0.126%
5.0% Δp -0.207%
1.0% Δp -0.434%
0.1% Δp -1.016%
Minimum Δp -4.672%
RMS Δp 0.150 ± 0.001 %
Same top p 99.739 ± 0.013 %

@teleprint-me
Copy link
Contributor

teleprint-me commented May 8, 2024

Was this tested on a threadripper? That was @jarts primary motivation for the PR. To leverage the bfloat feature set available within the threadripper CPU.

@JohannesGaessler
Copy link
Collaborator Author

No, I tested this using an Epyc 7742.

@teleprint-me
Copy link
Contributor

teleprint-me commented May 8, 2024

No, I tested this using an Epyc 7742.

I think this is interesting regardless. It's good to know either way.

@JohannesGaessler
Copy link
Collaborator Author

I re-ran the calculation with patched prints and updated the tables. One way to look at the difference is that if you were to generate the entire 300k tokens of Wikitext with FP16 instead of BF16 and temperature 1.0 you would on average generate 0.2 +- 1.1 more incorrect tokens. So I think it's safe to just use FP16 even if the original weights are BF16.

@JohannesGaessler
Copy link
Collaborator Author

I consider this PR low-priority to actually merge so I'll keep it open a little longer in case people want to comment on the results or the way they're presented.

@JohannesGaessler
Copy link
Collaborator Author

@jart notifying you of this PR in case you want to add context or comment on the results or methodology.
(The mention in a previous post was of an unrelated "jarts".)

@mofosyne mofosyne added documentation Improvements or additions to documentation review complexity : low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix labels May 9, 2024
@jart
Copy link
Contributor

jart commented May 9, 2024

In order for me to feel comfortable discussing expanding the perplexity tool documentation, I would first want to know how to reproduce the original numbers that were posted in that README. There needs to be specific information on which specific (1) commands were run, (2) files that were used, (3) revisions of the software, and (4) the microprocessor model on which it ran. For example, if I run llama-2 70b instruct on Threadripper PRO 7995WX using https://cosmo.zip/pub/datasets/wikitext-2-raw/

image

I get 4.1870 for llamafile with bf16 (on my v0.8.2 working area), and 4.6686 for llama.cpp on f98eb31 with f16. Since lower perplexity is supposed to be better, that's not anywhere close to what the readme says llama.cpp did last year, which is 3.4313.

So to answer the question that was asked of me earlier, I'll share an apples-to-apples comparison of yesterday's llama.cpp on znver4 for bf16 vs. fp16, freshly and fully re-quantized. Except I made a mistake. I accidentally ran the same command twice, and surprisingly got different results (4.6692 vs. 4.6696), even though I set temperature to zero.

image

Now here's the results for BF16 vs. F16 for llama.cpp yesterday with Meta LLaMA 3.

image

I think using perplexity is kind of like hiring an assistant based on how well they recite the pledge of allegiance. Using it to measure the difference between F16 and BF16 in llama.cpp would be analogous to builders in the middle ages debating the effectiveness of rain gutters installed on the Duomo in Florence before the roof has been constructed. The only empirical measurement I understand is ULP and we know for certain that BF16 does it better when the weights were given to us as BF16. However BF16 is just one small piece of the puzzle. For example, I just mailed out this yesterday:

I'm sure there's additional opportunities for improvement this project can spot, to reach a level where the numbers are stable enough for us to know llama.cpp is in a good position to use objective metrics to gauge the subtleties regarding bf16 vs. f16.

What's been guiding my work has been human evaluation on the output of larger models, e.g. Mixtral 8x22b, and I know that things like Kahan summation, BF16, and expf() make it better. I also know that ULP measurements of float32 values and decades of computer science tell me that it's empirically better too. However perplexity does not appear to measure these differences I notice, and in many cases I see it get worse the better the algorithms are.

In any case, I'm confident this project is on the right path. I believe the right decisions are being made and that the right things are valued. I'm looking forward to seeing more of these high-performance higher-quality components being put in place.

@ggerganov
Copy link
Owner

Except I made a mistake. I accidentally ran the same command twice, and surprisingly got different results (4.6692 vs. 4.6696), even though I set temperature to zero.

This should never happen - the results from perplexity should be deterministic given the same parameters, backend, etc. (the temperature is irrelevant for this tool because there is no sampling during the ppl computation). If it's not the case, this is a bug and it should be fixed

@JohannesGaessler
Copy link
Collaborator Author

There needs to be specific information on which specific (1) commands were run, (2) files that were used, (3) revisions of the software, and (4) the microprocessor model on which it ran.

Definitely reasonable, I'll add this information.

I get 4.1870 for llamafile with bf16 (on my v0.8.2 working area), and 4.6686 for llama.cpp on f98eb31 with f16. Since lower perplexity is supposed to be better, that's not anywhere close to what the readme says llama.cpp did last year, which is 3.4313.

Perplexity depends heavily on the dataset so --chunks 32 strongly changes the result compared to all 655 chunks.

Also notice that for 32 chunks the statistical uncertainty on the individual values is more than 300 times larger than the difference between FP16 and BF16. Due to the extremely high correlation the uncertainty on the difference is going to be much smaller but without knowing the exact value of the correlation it is not possible to make conclusive statements. My recommendation is to run once with --kl-divergence-base and to then run once more with --kl-divergence. This will calculate the difference in perplexity while considering the covariance so you get a usable uncertainty on the result.

I think using perplexity is kind of like hiring an assistant based on how well they recite the pledge of allegiance. Using it to measure the difference between F16 and BF16 in llama.cpp would be analogous to builders in the middle ages debating the effectiveness of rain gutters installed on the Duomo in Florence before the roof has been constructed. The only empirical measurement I understand is ULP and we know for certain that BF16 does it better when the weights were given to us as BF16. However BF16 is just one small piece of the puzzle.

I definitely agree that perplexity is a suboptimal metric. In this particular case I think it makes more sense to just look at how the token probabilities change. Given the 300k input tokens of Wikitext 2 test there is no statistically significant evidence that either the FP16 or BF16 version of LLaMA 3 8b are better at correctly predicting the next token than the other.

I do not know what you mean by ULP.

What's been guiding my work has been human evaluation on the output of larger models, e.g. Mixtral 8x22b, and I know that things like Kahan summation, BF16, and expf() make it better. I also know that ULP measurements of float32 values and decades of computer science tell me that it's empirically better too. However perplexity does not appear to measure these differences I notice, and in many cases I see it get worse the better the algorithms are.

Did you check the statistical significance of your results? Intuitively I would think the differences between FP16 and BF16 are too small to make collection of enough human data feasible.

This should never happen - the results from perplexity should be deterministic given the same parameters, backend, etc. (the temperature is irrelevant for this tool because there is no sampling during the ppl computation). If it's not the case, this is a bug and it should be fixed

Are there instances where threads combine their results in an undefined order? If yes that creates a race condition in terms of floating point rounding error.

@teleprint-me
Copy link
Contributor

I do not know what you mean by ULP.

Unit in the last place?

@mofosyne mofosyne added the need feedback Testing and feedback with results are needed label May 10, 2024
@ggerganov
Copy link
Owner

Are there instances where threads combine their results in an undefined order? If yes that creates a race condition in terms of floating point rounding error.

There are no instances. I wasn't able to reproduce the non-determinism, though

@JohannesGaessler JohannesGaessler merged commit 1c570d8 into ggerganov:master May 13, 2024
21 checks passed
teleprint-me pushed a commit to teleprint-me/llama.cpp that referenced this pull request May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation need feedback Testing and feedback with results are needed review complexity : low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants