convert-hf : support bfloat16 conversion #7158

compilade · 2024-05-09T04:06:10Z

As a follow-up to #7075 and #6412, this introduces proper lazy bfloat16 conversion in convert-hf-to-gguf.py with Numpy.

Numpy does not yet support bfloat16, but this is still possible.

This implementation, like the one in ggml-impl.h, makes nan quiet, flushes subnormals to zero, and rounds to nearest even value.
This means bf16 tensor data made with convert-hf-to-gguf.py should match exactly what ./quantize produces from f32 models.

Summary of changes

Unify how lazy tensors work for PyTorch and Numpy
- I've added gguf-py/gguf/lazy.py for this, which defines the LazyMeta metaclass and the LazyBase base class which is used by both LazyNumpyTensor and LazyTorchTensor.
  - LazyTorchTensor is still defined in convert-hf-to-gguf.py to avoid torch dependency in gguf-py.
- Lazy Numpy tensors can now support arbitrary expression splits, where one tensor is used in more than one calculation.
- No more risk of deep recursion, with the use of a deque of lazy tensors per expression graph
bfloat16 conversion support
- Add LlamaFileType in gguf-py/gguf/constants.py to get the correct ftype values as in llama_ftype from llama.h.
  - Not called GGMLFileType because it's probably best to reserve this name for an enum analogous to ggml_ftype
  - Still open to name suggestions :)
--outtype auto to choose the highest-fidelity 16-bit floating point type according to the type of the first loaded tensor.
- Uses f16 if the first tensor has dtype torch.float16, and uses bf16 otherwise, so that torch.float32 and torch.bfloat16 tensors keep their range.
--outfile name templating
- allows using python3 convert-hf-to-gguf.py --outfile path/to/llama-3-8b-instruct-{ftype}.gguf --outtype auto ./path/to/Meta-Llama-3-8B-Instruct/, and still get the automatically-chosen output type in the name.

Testing

Note

The checksum of a model converted with

$ python3 convert-hf-to-gguf.py --outfile ./models/ggml-model-bf16.gguf --outtype bf16 ./path/to/model_dir/

and one converted with

$ python3 convert-hf-to-gguf.py --outfile ./models/ggml-model-f32.gguf --outtype f32 ./path/to/model_dir/
$ ./build/bin/quantize ./models/ggml-model-f32.gguf ./models/ggml-model-bf16.gguf bf16

SHOULD EXACTLY MATCH (as of 95930da)

Mamba
- @compilade Both bf16 and f32 quantized to bf16 (with ./quantize) result in the same output at --temp 1 given the same seed with https://huggingface.co/delphi-suite/v0-mamba-100k. Model made with --no-lazy matches checksum of lazily-converted model. (sha256sum: 272030278cb50b8b1eece85e175e1681c2aeabc430330a38380d2e441099a996)
BERT
- @compilade checksum matches for https://huggingface.co/BAAI/bge-small-en-v1.5 in bf16 compared to ./quantize output from f32 (sha256sum: 560d1f957d9cb77ddcbbe558cbbb394d18af0500f402be25cb6b2292a3b3cc8a)
Llama with MoE
- @compilade checksum matches for https://huggingface.co/jtatman/TinyMistral-248m-v2.5-4x-Moe in bf16 compared to ./quantize output from f32 (sha256sum: d6e6e1977ffa9cc365d425c107d7a79770b9425cab9db04d695e092fedd00d72)
Other models (TODO, but they should work)

(relevant for at least @jart, @teleprint-me, @bartowski1182)

ggerganov · 2024-05-09T08:49:30Z

Can try to add a small test to ci/run.sh that exercises the conversion. The test can download a small Mamba model and run a short main for example. It should run only on nodes that have GG_BUILD_EXTRA_TESTS_0 defined since the conversion is not hardware-dependent, so no need to run it on all the nodes in the ggml-ci fleet

bartowski1182 · 2024-05-09T15:00:22Z

Based on your description, my assumption is that if the original weights are in bf16, you should do convert with outtype bf16 and then everything will just work a bit better. Otherwise no other changes to pipeline (from an end-user perspective) are needed. Is that correct? Or is there even some kind of auto-detection?

compilade · 2024-05-09T15:42:11Z

Based on your description, my assumption is that if the original weights are in bf16, you should do convert with outtype bf16 and then everything will just work a bit better. Otherwise no other changes to pipeline (from an end-user perspective) are needed. Is that correct? Or is there even some kind of auto-detection?

Yes this is correct. Using --outtype bf16 should be enough. There is no auto-detection for now.

I just now found a way to make bit-exact identical bf16 model outputs from convert-hf-to-gguf.py compared to ./quantize from an f32 model. I'll push it soon (< 1hr). This will make testing for correctness easier (matching checksums should then be sufficient).

Galunid · 2024-05-09T15:58:12Z

convert-hf-to-gguf.py

@@ -2417,8 +2372,8 @@ def parse_args() -> argparse.Namespace:
 help="path to write to; default: based on input",
 )
 parser.add_argument(
- "--outtype", type=str, choices=["f32", "f16"], default="f16",
- help="output format - use f32 for float32, f16 for float16",
+ "--outtype", type=str, choices=["f32", "f16", "bf16"], default="f16",


Given most models come in bf16 wouldn't it make sense to set it as default?

The conversion to bf16 is slightly slower and uses a bit more RAM than f16 conversion, due to the lack of native Numpy support, so I didn't change the default.

I'll see if I can auto-detect whether the model contains bf16 tensors (but it will most likely be too complicated). Otherwise, it does make sense to set bf16 as default if it's widely used.

My concern with bf16 as default is that f16 -> bf16 is more lossy than bf16 -> f16, since 3 bits of the mantissa are always lost in f16 -> bf16, while bf16 -> f16 only turns some very-close-to-zero values into zero, and big values get turned to inf (but such big values are usually not in model weights, see #7075 (comment)).

We only have rudimentary CPU-only bf16 support in ggml, so f16 is better for now

(left comment on wrong account)

I'll see if I can auto-detect whether the model contains bf16 tensors (but it will most likely be too complicated). Otherwise, it does make sense to set bf16 as default if it's widely used.

@compilade could we not attempt to read from config.json? it should have a torch_dtype in it

Ggerganov, when you say CPU-only, I assume you're referring to inference, since all conversion and quantization is currently CPU-only?

could we not attempt to read from config.json? it should have a torch_dtype in it

@bartowski1182 yes, but not all models define that field, so I think a second guess based on the type of the first tensor type in the model will sometimes be necessary.

⚠️ And also some models say one thing in config.json and use another type in the model files. For example, https://huggingface.co/pansophic/rocket-3B has F16 tensors, but defines torch_dtype as bfloat16 in config.json

Would you still like some kind of --outtype auto-f16 based on the model content even if f16 is kept as the default --outtype otherwise? (due to (slightly) faster conversion, and more complete backend support)

Can you explain why bf16 is only used for CPU? Will there be GPU support in the future?

@htdung167 This PR is only about conversion, and the convert script always has been CPU-only.
Inference (text generation) with bf16 was worked on by @jart in #6412. A relevant comment from there regarding future GPU support would be #6412 (comment).

I think '--outtype auto' might be fine, since at its core that's what it's doing

@bartowski1182 I agree, I've thought about this more, and I'll change this to auto instead of auto-f16. I don't think there will be auto-anything-else anyway.

is it possible to figure out the auto-chosen version earlier and use it for naming the outfile?

yes, this is already possible by checking the logs. It would also be possible to do automatically, but not everyone has the same naming conventions, so maybe the right way to do this would be with a .format() pattern? For example, --outfile llama-3-8b-instruct-{ftype}.gguf, or --outfile llama-3-8b-instruct-{outtype}.gguf or --outfile llama-3-8b-instruct-{}.gguf. Not sure which to support (all?), but it should be clearer than %s. It would also be possible to allow using {FTYPE} or {OUTTYPE} for upper-cased type names.

I see you've used fp16 and fp32 in the past, but this will use f16 and f32, respectively, for these type names.

(EDIT: this is now implemented in e0af2df)

yeah i've made the transition to f16/f32, was an oversight from me naming them 'fp'

a format option would be amazing

Should auto simply try to be as lossless as possible? Like, if the model is originally in f32, make the output f32? Or should it always select a 16-bit type? (currently bf16 is selected for f32 models)

My vote would be on compressing even if originally it was f32, if the person covering wants f32 they'll specify, otherwise presumably they're always converting with the intention of quantizing where it won't matter

"Do no harm" should be the default.

The quantization version was missing. * convert-hf : don't round bf16 NANs * convert-hf : save some memory with np.int16 intermediate bf16 weights * convert-hf : more closely match llama.cpp with which weights to keep in f32

A reason for this to exist is for model quantizers who want an initial GGUF with the most fidelity to the original model while still using a 16-bit float type instead of 32-bit floats.

teleprint-me · 2024-05-09T21:07:39Z

convert-hf-to-gguf.py

+ # same as ggml_compute_fp32_to_bf16 in ggml-impl.h
+ def np_fp32_to_bf16(n: np.ndarray):
+ # force nan to quiet
+ n = np.where((n & 0x7fffffff) > 0x7f800000, (n & 0xffff0000) | (64 << 16), n)


Everything looks good. This is the only line that bugs me. Not sure about the (64 << 16) for logical OR. I'm knee deep in a dataset, so my mental state is not 100% there right now. Could be nothing.

It's doing the equivalent of these lines in ggml-impl.h:

llama.cpp/ggml-impl.h

Lines 82 to 85 in befddd0

if ((u.i & 0x7fffffff) > 0x7f800000) { /* nan */

h.bits = (u.i >> 16) | 64; /* force to quiet */

return h;

}

(64 << 16) came from adapting this while postponing the right shift until after rounding.
(n & 0xffff0000) is to avoid rounding up the NANs by setting the low bits to zero.

It's a reflex from when programming in C/C++, I guess.

* convert-hf : rename --outtype auto-f16 to --outtype auto

bartowski1182 · 2024-05-10T20:27:18Z

This is looking really good. What's the next steps to get this merged? I can do some testing if that's what is needed

compilade · 2024-05-11T01:21:21Z

What's the next steps to get this merged?

Honestly, it's pretty much ready. But the newly-added --outfile templates may need to be reviewed.
From my own manual tests the bf16 conversion works quite well and outputs the exact same files (with matching cheksums) as if converted from ./quantize from a f32 model.
If you can find a counter-example I'd be very curious about that.

--outtype auto seems to behave well too, automatically choosing either f16 or bf16. The only thing that could still be changed is the default outtype when --outtype is not specified on the command line, but I've left it to f16 for now because it's supported by more backends in ggml (see #7158 (comment)).

Conversion performance (speed-wise) for bf16 isn't that good (around 40-60 MB/s on my machine), but this is inherent to the lack of native support in Numpy. It can be improved later, since at least this is usable and produces the correct bits. I have some ideas of how to make this faster, even before trying to multi-thread it, like using something else than np.vectorize. I have some performance improvements ready (making bf16 conversion go at 104 MB/s), but they are tied to other changes that also add Q8_0 support to convert-hf-to-gguf.py (giving identical results as with the reference implementation in ggml-quants.c, unlike the current Q8_0 conversion in convert.py) which will be more appropriate in another PR.

If there is no objection, I would like to merge this at 2024-05-11 15:00 UTC.

bartowski1182 · 2024-05-11T02:04:46Z

I assume any slowdown in converting to bf16 is made up for by the speed of quanting bf16 instead of f32

Actually on that subject, since we can't inference bf16 with GPU, can we make imatrix for bf16 with GPU?

Looking forward to it! may pull it to try out in the interim

jart · 2024-05-11T03:09:05Z

Actually on that subject, since we can't inference bf16 with GPU

Yet.

bartowski1182 · 2024-05-11T14:33:44Z

@jart love the hint haha. Yeah I figured it's coming, but in the meantime I'm curious how it works, is GPU inference support required for imatrix on GPU?

compilade added 3 commits May 8, 2024 23:14

convert-hf : support bfloat16 conversion

6f8d280

gguf-py : flake8 fixes

59f5a27

convert-hf : add missing space after comma

3801db1

compilade mentioned this pull request May 9, 2024

Introduce bfloat16 support #6412

Merged

ggerganov approved these changes May 9, 2024

View reviewed changes

mofosyne added performance Speed related topics review complexity : medium Generally require more time to grok but manageable by beginner to medium expertise level labels May 9, 2024

Galunid reviewed May 9, 2024

View reviewed changes

compilade added 2 commits May 9, 2024 12:29

convert-hf : get bit-exact same output as ./quantize

95930da

The quantization version was missing. * convert-hf : don't round bf16 NANs * convert-hf : save some memory with np.int16 intermediate bf16 weights * convert-hf : more closely match llama.cpp with which weights to keep in f32

convert-hf : add --outtype auto-f16

58b515c

A reason for this to exist is for model quantizers who want an initial GGUF with the most fidelity to the original model while still using a 16-bit float type instead of 32-bit floats.

teleprint-me reviewed May 9, 2024

View reviewed changes

compilade added 2 commits May 9, 2024 17:18

convert-hf : remove a semicolon because flake8 doesn't like it

d3d32a6

It's a reflex from when programming in C/C++, I guess.

convert-hf : support outtype templating in outfile name

e0af2df

* convert-hf : rename --outtype auto-f16 to --outtype auto

compilade merged commit 5a41992 into master May 11, 2024
25 checks passed

compilade mentioned this pull request May 12, 2024

convert-hf : support direct Q8_0 conversion #7234

Merged

13 tasks

compilade mentioned this pull request May 24, 2024

convert-*.py: GGUF Naming Convention Refactor and Metadata Override Refactor #7499

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

convert-hf : support bfloat16 conversion #7158

convert-hf : support bfloat16 conversion #7158

compilade commented May 9, 2024 •

edited

ggerganov commented May 9, 2024

bartowski1182 commented May 9, 2024

compilade commented May 9, 2024

Galunid May 9, 2024

compilade May 9, 2024 •

edited

ggerganov May 9, 2024

bartowski1182 May 9, 2024

compilade May 9, 2024 •

edited

compilade May 10, 2024 •

edited

bartowski1182 May 10, 2024

compilade May 10, 2024 •

edited

bartowski1182 May 10, 2024

jart May 11, 2024

teleprint-me May 9, 2024

compilade May 9, 2024

bartowski1182 commented May 10, 2024

compilade commented May 11, 2024 •

edited

bartowski1182 commented May 11, 2024

jart commented May 11, 2024

bartowski1182 commented May 11, 2024

	if ((u.i & 0x7fffffff) > 0x7f800000) { /* nan */
	h.bits = (u.i >> 16) \| 64; /* force to quiet */
	return h;
	}

convert-hf : support bfloat16 conversion #7158

convert-hf : support bfloat16 conversion #7158

Conversation

compilade commented May 9, 2024 • edited

Summary of changes

Testing

ggerganov commented May 9, 2024

bartowski1182 commented May 9, 2024

compilade commented May 9, 2024

Choose a reason for hiding this comment

compilade May 9, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

compilade May 9, 2024 • edited

Choose a reason for hiding this comment

compilade May 10, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

compilade May 10, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bartowski1182 commented May 10, 2024

compilade commented May 11, 2024 • edited

bartowski1182 commented May 11, 2024

jart commented May 11, 2024

bartowski1182 commented May 11, 2024

compilade commented May 9, 2024 •

edited

compilade May 9, 2024 •

edited

compilade May 9, 2024 •

edited

compilade May 10, 2024 •

edited

compilade May 10, 2024 •

edited

compilade commented May 11, 2024 •

edited