Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert-hf : support bfloat16 conversion #7158

Merged
merged 7 commits into from May 11, 2024

Conversation

compilade
Copy link
Collaborator

@compilade compilade commented May 9, 2024

As a follow-up to #7075 and #6412, this introduces proper lazy bfloat16 conversion in convert-hf-to-gguf.py with Numpy.

Numpy does not yet support bfloat16, but this is still possible.

This implementation, like the one in ggml-impl.h, makes nan quiet, flushes subnormals to zero, and rounds to nearest even value.
This means bf16 tensor data made with convert-hf-to-gguf.py should match exactly what ./quantize produces from f32 models.

Summary of changes

  • Unify how lazy tensors work for PyTorch and Numpy
    • I've added gguf-py/gguf/lazy.py for this, which defines the LazyMeta metaclass and the LazyBase base class which is used by both LazyNumpyTensor and LazyTorchTensor.
      • LazyTorchTensor is still defined in convert-hf-to-gguf.py to avoid torch dependency in gguf-py.
    • Lazy Numpy tensors can now support arbitrary expression splits, where one tensor is used in more than one calculation.
    • No more risk of deep recursion, with the use of a deque of lazy tensors per expression graph
  • bfloat16 conversion support
    • Add LlamaFileType in gguf-py/gguf/constants.py to get the correct ftype values as in llama_ftype from llama.h.
      • Not called GGMLFileType because it's probably best to reserve this name for an enum analogous to ggml_ftype
      • Still open to name suggestions :)
  • --outtype auto to choose the highest-fidelity 16-bit floating point type according to the type of the first loaded tensor.
    • Uses f16 if the first tensor has dtype torch.float16, and uses bf16 otherwise, so that torch.float32 and torch.bfloat16 tensors keep their range.
  • --outfile name templating
    • allows using python3 convert-hf-to-gguf.py --outfile path/to/llama-3-8b-instruct-{ftype}.gguf --outtype auto ./path/to/Meta-Llama-3-8B-Instruct/, and still get the automatically-chosen output type in the name.

Testing

Note

The checksum of a model converted with

$ python3 convert-hf-to-gguf.py --outfile ./models/ggml-model-bf16.gguf --outtype bf16 ./path/to/model_dir/

and one converted with

$ python3 convert-hf-to-gguf.py --outfile ./models/ggml-model-f32.gguf --outtype f32 ./path/to/model_dir/
$ ./build/bin/quantize ./models/ggml-model-f32.gguf ./models/ggml-model-bf16.gguf bf16

SHOULD EXACTLY MATCH (as of 95930da)

(relevant for at least @jart, @teleprint-me, @bartowski1182)

@ggerganov
Copy link
Owner

Can try to add a small test to ci/run.sh that exercises the conversion. The test can download a small Mamba model and run a short main for example. It should run only on nodes that have GG_BUILD_EXTRA_TESTS_0 defined since the conversion is not hardware-dependent, so no need to run it on all the nodes in the ggml-ci fleet

@mofosyne mofosyne added performance Speed related topics review complexity : medium Generally require more time to grok but manageable by beginner to medium expertise level labels May 9, 2024
@bartowski1182
Copy link
Contributor

Based on your description, my assumption is that if the original weights are in bf16, you should do convert with outtype bf16 and then everything will just work a bit better. Otherwise no other changes to pipeline (from an end-user perspective) are needed. Is that correct? Or is there even some kind of auto-detection?

@compilade
Copy link
Collaborator Author

Based on your description, my assumption is that if the original weights are in bf16, you should do convert with outtype bf16 and then everything will just work a bit better. Otherwise no other changes to pipeline (from an end-user perspective) are needed. Is that correct? Or is there even some kind of auto-detection?

Yes this is correct. Using --outtype bf16 should be enough. There is no auto-detection for now.

I just now found a way to make bit-exact identical bf16 model outputs from convert-hf-to-gguf.py compared to ./quantize from an f32 model. I'll push it soon (< 1hr). This will make testing for correctness easier (matching checksums should then be sufficient).

@@ -2417,8 +2372,8 @@ def parse_args() -> argparse.Namespace:
help="path to write to; default: based on input",
)
parser.add_argument(
"--outtype", type=str, choices=["f32", "f16"], default="f16",
help="output format - use f32 for float32, f16 for float16",
"--outtype", type=str, choices=["f32", "f16", "bf16"], default="f16",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given most models come in bf16 wouldn't it make sense to set it as default?

Copy link
Collaborator Author

@compilade compilade May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conversion to bf16 is slightly slower and uses a bit more RAM than f16 conversion, due to the lack of native Numpy support, so I didn't change the default.

I'll see if I can auto-detect whether the model contains bf16 tensors (but it will most likely be too complicated). Otherwise, it does make sense to set bf16 as default if it's widely used.

My concern with bf16 as default is that f16 -> bf16 is more lossy than bf16 -> f16, since 3 bits of the mantissa are always lost in f16 -> bf16, while bf16 -> f16 only turns some very-close-to-zero values into zero, and big values get turned to inf (but such big values are usually not in model weights, see #7075 (comment)).

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only have rudimentary CPU-only bf16 support in ggml, so f16 is better for now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(left comment on wrong account)

I'll see if I can auto-detect whether the model contains bf16 tensors (but it will most likely be too complicated). Otherwise, it does make sense to set bf16 as default if it's widely used.

@compilade could we not attempt to read from config.json? it should have a torch_dtype in it

Ggerganov, when you say CPU-only, I assume you're referring to inference, since all conversion and quantization is currently CPU-only?

Copy link
Collaborator Author

@compilade compilade May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we not attempt to read from config.json? it should have a torch_dtype in it

@bartowski1182 yes, but not all models define that field, so I think a second guess based on the type of the first tensor type in the model will sometimes be necessary.

⚠️ And also some models say one thing in config.json and use another type in the model files. For example, https://huggingface.co/pansophic/rocket-3B has F16 tensors, but defines torch_dtype as bfloat16 in config.json

Would you still like some kind of --outtype auto-f16 based on the model content even if f16 is kept as the default --outtype otherwise? (due to (slightly) faster conversion, and more complete backend support)

Copy link
Collaborator Author

@compilade compilade May 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why bf16 is only used for CPU? Will there be GPU support in the future?

@htdung167 This PR is only about conversion, and the convert script always has been CPU-only.
Inference (text generation) with bf16 was worked on by @jart in #6412. A relevant comment from there regarding future GPU support would be #6412 (comment).

I think '--outtype auto' might be fine, since at its core that's what it's doing

@bartowski1182 I agree, I've thought about this more, and I'll change this to auto instead of auto-f16. I don't think there will be auto-anything-else anyway.

is it possible to figure out the auto-chosen version earlier and use it for naming the outfile?

yes, this is already possible by checking the logs. It would also be possible to do automatically, but not everyone has the same naming conventions, so maybe the right way to do this would be with a .format() pattern? For example, --outfile llama-3-8b-instruct-{ftype}.gguf, or --outfile llama-3-8b-instruct-{outtype}.gguf or --outfile llama-3-8b-instruct-{}.gguf. Not sure which to support (all?), but it should be clearer than %s. It would also be possible to allow using {FTYPE} or {OUTTYPE} for upper-cased type names.

I see you've used fp16 and fp32 in the past, but this will use f16 and f32, respectively, for these type names.

(EDIT: this is now implemented in e0af2df)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i've made the transition to f16/f32, was an oversight from me naming them 'fp'

a format option would be amazing

Copy link
Collaborator Author

@compilade compilade May 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should auto simply try to be as lossless as possible? Like, if the model is originally in f32, make the output f32? Or should it always select a 16-bit type? (currently bf16 is selected for f32 models)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My vote would be on compressing even if originally it was f32, if the person covering wants f32 they'll specify, otherwise presumably they're always converting with the intention of quantizing where it won't matter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Do no harm" should be the default.

The quantization version was missing.

* convert-hf : don't round bf16 NANs

* convert-hf : save some memory with np.int16 intermediate bf16 weights

* convert-hf : more closely match llama.cpp with which weights to keep in f32
A reason for this to exist is for model quantizers who want an initial
GGUF with the most fidelity to the original model while still using
a 16-bit float type instead of 32-bit floats.
# same as ggml_compute_fp32_to_bf16 in ggml-impl.h
def np_fp32_to_bf16(n: np.ndarray):
# force nan to quiet
n = np.where((n & 0x7fffffff) > 0x7f800000, (n & 0xffff0000) | (64 << 16), n)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good. This is the only line that bugs me. Not sure about the (64 << 16) for logical OR. I'm knee deep in a dataset, so my mental state is not 100% there right now. Could be nothing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's doing the equivalent of these lines in ggml-impl.h:

llama.cpp/ggml-impl.h

Lines 82 to 85 in befddd0

if ((u.i & 0x7fffffff) > 0x7f800000) { /* nan */
h.bits = (u.i >> 16) | 64; /* force to quiet */
return h;
}

(64 << 16) came from adapting this while postponing the right shift until after rounding.
(n & 0xffff0000) is to avoid rounding up the NANs by setting the low bits to zero.

It's a reflex from when programming in C/C++, I guess.
* convert-hf : rename --outtype auto-f16 to --outtype auto
@bartowski1182
Copy link
Contributor

This is looking really good. What's the next steps to get this merged? I can do some testing if that's what is needed

@compilade
Copy link
Collaborator Author

compilade commented May 11, 2024

What's the next steps to get this merged?

Honestly, it's pretty much ready. But the newly-added --outfile templates may need to be reviewed.
From my own manual tests the bf16 conversion works quite well and outputs the exact same files (with matching cheksums) as if converted from ./quantize from a f32 model.
If you can find a counter-example I'd be very curious about that.

--outtype auto seems to behave well too, automatically choosing either f16 or bf16. The only thing that could still be changed is the default outtype when --outtype is not specified on the command line, but I've left it to f16 for now because it's supported by more backends in ggml (see #7158 (comment)).

Conversion performance (speed-wise) for bf16 isn't that good (around 40-60 MB/s on my machine), but this is inherent to the lack of native support in Numpy. It can be improved later, since at least this is usable and produces the correct bits. I have some ideas of how to make this faster, even before trying to multi-thread it, like using something else than np.vectorize. I have some performance improvements ready (making bf16 conversion go at 104 MB/s), but they are tied to other changes that also add Q8_0 support to convert-hf-to-gguf.py (giving identical results as with the reference implementation in ggml-quants.c, unlike the current Q8_0 conversion in convert.py) which will be more appropriate in another PR.

If there is no objection, I would like to merge this at 2024-05-11 15:00 UTC.

@bartowski1182
Copy link
Contributor

I assume any slowdown in converting to bf16 is made up for by the speed of quanting bf16 instead of f32

Actually on that subject, since we can't inference bf16 with GPU, can we make imatrix for bf16 with GPU?

Looking forward to it! may pull it to try out in the interim

@jart
Copy link
Contributor

jart commented May 11, 2024

Actually on that subject, since we can't inference bf16 with GPU

Yet.

@bartowski1182
Copy link
Contributor

@jart love the hint haha. Yeah I figured it's coming, but in the meantime I'm curious how it works, is GPU inference support required for imatrix on GPU?

@compilade compilade merged commit 5a41992 into master May 11, 2024
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Speed related topics review complexity : medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants