-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gguf-hash: model wide and per tensor hashing using xxhash and sha1 #8048
base: master
Are you sure you want to change the base?
Conversation
There is also the option to use existing hash utilities to hash the GGUF data. For example, something like: # skip the GGUF header
dd bs=1 skip=$(gguf-dump --data-offset model.gguf) if=model.gguf | sha256sum Would that work? |
@ggerganov gave your approach a shot #8054 (PR to add --data-offset and --data-alignment) it does work, but your initial suggestion of setting bs=1 and using skip=X was very slow. Turns out you should set bs=X and skip=1. $:~/Documents/LLMmodel/gguf$ time dd bs=$(~/gitextern/llama.cpp/gguf-py/scripts/gguf-dump.py --data-offset phi-2.Q6_K.gguf) skip=1 if=phi-2.Q6_K.gguf | sha1sum
1264+1 records in
1264+1 records out
2283253760 bytes (2.3 GB, 2.1 GiB) copied, 4.32916 s, 527 MB/s
32ea6e22a0c63beef6ce2ba15471689b8144b39c -
real 0m7.200s
user 0m6.797s
sys 0m1.326s
$:~/Documents/LLMmodel/gguf$ time dd bs=$(~/gitextern/llama.cpp/gguf-py/scripts/gguf-dump.py --data-offset phi-2.Q6_K.gguf) skip=1 if=phi-2.Q6_K.gguf | sha256sum
1264+1 records in
1264+1 records out
2283253760 bytes (2.3 GB, 2.1 GiB) copied, 9.95004 s, 229 MB/s
8b5eea25e2946b05e345dc0e1dea191968bd2ebc6a15cb321085391dc89d9692 -
real 0m13.016s
user 0m12.744s
sys 0m1.509s Looks like I think GG's approach is valid as it will be faster as long as these assumption holds (so we could use that for internal CI tests as it be obvious if it breaks because of gguf file format evolution). However you may still want to keep this PR if you want to support per tensor hash checks. Also I would like to develop a consistent way to identify gguf models by model tensors (even if kv metadata changes) |
Attempted to add sha256 to the gguf-hash.c, but for some reason it just doesn't want to work, so abandoned that approach. Anyway, I've added UUIDv5 model ID generation to the C implementation (Using uuid.uuid5(uuid.NAMESPACE_URL, 'en.wikipedia.org/wiki/Llama.cpp') --> "ef001206-dadc-5f6d-a15f-3359e577d4e5" as the UUIDv5 namespace) and made sure it matches the python implementation. This was relatively easy as I've already got sha1 working in gguf-hash.c So now we got a consistent way of generating a UUIDv5 based on the GGUF tensor content if we choose to do so. Below is how I checked both generated the same UUIDv5
Anyway, this PR is now considered operational. |
0dbd834
to
029a963
Compare
Unsure what's the issue with makefile in the windows context... |
@mofosyne The problem is with https://github.com/ggerganov/llama.cpp/actions/runs/9632516256/job/26565799805?pr=8048#step:7:80
|
@compilade. That's pretty strange... so basically visual studio don't support all C11 features? This is the checks in xxhash.h #if defined(__STDC_VERSION__) && (__STDC_VERSION__ >= 201112L) /* >= C11 */
# include <stdalign.h>
# define XXH_ALIGN(n) alignas(n)
#elif defined(__cplusplus) && (__cplusplus >= 201103L) /* >= C++11 */
/* In C++ alignas() is a keyword */
# define XXH_ALIGN(n) alignas(n)
#elif defined(__GNUC__)
# define XXH_ALIGN(n) __attribute__ ((aligned(n)))
#elif defined(_MSC_VER)
# define XXH_ALIGN(n) __declspec(align(n))
#else
# define XXH_ALIGN(n) /* disabled */
#endif edit: Turns out windows C11 at least for windows-2019 (unsure if fixed on windows-2020 github runner) is lying about it's support for C11 standard as explained in google-deepmind/mujoco#862 . They had to do a workaround in google-deepmind/mujoco@ac6663f . Be interesting to see if newer windows build works better... should we update the github runner to the latest windows-2020 version? (Pushing a commit to test the idea) |
This is a WIP PR proposal for layer hashing and model hashing of each layer of a gguf model
I previously did an experiment attempting to make the hashing process independent of quantisation, but that turns out to have too many technical issue and plus the use case for such feature is uncertain.
This PR on the other hands focus only on just hashing each tensors as an opaque data area without caring to decode the content.
The application for this feature is as part of some ci flow so instead of storing test files outputs you can just store the expected hash output. You can use this then to check for regression. For this reason i added xxhash as it is a much faster than sha1 in hashing, but left sha1 in because it is more widely supported (e.g. built into python)
I also for the python hash implementation added a UUIDv5 generator, I plan to add that to the c side if it makes sense.
As one of my idea is that every model would have a unique UUID based on the model content. Would be happy to hear feedback about this as I plan to include it during model conversion processes. e.g.
Note that for the global model wide hashing, I just hash every tensor in the order that was dumped from the gguf file... so if the tensor order is swapped in the file then the hash will likely change.
(For this PR, I decided that KV store hash is outside of scope)
example of sha1 output of phi-2.Q6_K.gguf
example of xxhash output of phi-2.Q6_K.gguf