Script to convert Grok-1 weights from raw JAX pickle files. #7058

heiner · 2024-05-03T11:36:27Z

This adds a script to convert the raw weights in the pickle files to GGUF format. This allows using @arki05's work in #6204 directly from the Grok-1 torrent.

Code is based on @foldl's conversion script in chatllm.cpp, which in turn is based on @chu-tianxiang's gist.

Main ideas to avoid excessive memory:

Parse pickle files using mmap.
Use PyTorch "meta" tensors to simulate shape and dtype results without having to do all conversions beforehand.

Note that I couldn't run the full model due to RAM constrains and it's possible I mixed up some tensor names.

slaren · 2024-05-03T12:29:26Z

Does this merge the experts into a single tensor?

heiner · 2024-05-03T12:53:54Z

Does this merge the experts into a single tensor?

It does the opposite -- in the raw data, the 8 experts are part of the same tensor. This splits them, which is also what the chatllm.cpp script does.

If there is a way to keep them within one tensor I'm happy to make that change.

slaren · 2024-05-03T12:58:51Z

The preferred way to export the expert tensors is as a single 3D tensor for all the experts. It is still possible to use one tensor per expert for backwards compatibility, but it forces the model weights to be copied to a buffer while loading, rather than using them directly from the memory mapped file. For large models like grok, I think it is especially important to be able to avoid this copy and use mmap.

heiner · 2024-05-03T13:07:10Z

Understood. That will actually make the script simpler. Would you happen to know the tensor names I should use in this case? Currently when using splitting, they are

| blk.{layer}.ffn_gate.{expert}.weight        | torch.Size([32768, 6144])  | Q4_0    |
| blk.{layer}.ffn_down.{expert}.weight        | torch.Size([6144, 32768])  | Q4_0    |
| blk.{layer}.ffn_up.{expert}.weight          | torch.Size([32768, 6144])  | Q4_0    |

slaren · 2024-05-03T13:14:18Z

The tensor names are defined in gguf-py:

llama.cpp/gguf-py/gguf/constants.py

Lines 249 to 251 in 60325fa

 MODEL_TENSOR.FFN_GATE_EXP: "blk.{bid}.ffn_gate_exps", 

 MODEL_TENSOR.FFN_DOWN_EXP: "blk.{bid}.ffn_down_exps", 

 MODEL_TENSOR.FFN_UP_EXP: "blk.{bid}.ffn_up_exps",

It would be good to use these constants rather than hardcoding the names.

As per ggerganov#7058 (comment). This helps avoid a memcopy when running.

heiner · 2024-05-03T21:54:17Z

Thanks!

I have updated the branch to no longer split the MoE weights into separate tensors. That simplifies the script as it's now one weight per file. The original script permutated the order in which these weights are written for some reason, I stopped doing that now and thus there's only one list of weight names.

I also moved to the values in the gguf.TENSOR_NAMES dict as per your suggestion. I'm not sure that's a clear improvement ("Explicit is better than implicit."), especially in view of code like name.endswith("attn_k") and name.endswith("_exps") but it's also not much worse.

PTAL.

foldl · 2024-05-03T23:35:33Z

@heiner, name of my project is ChatLLM.cpp, not ChatLLM.ccp, 😄

ggerganov

We can merge after lint fixes

ggerganov · 2024-05-09T14:02:46Z

Hm, I tested Q4_0 conversion and it does not seem to work:

python3 convert_grok.py -i ~/Data/huggingface/grok-1/ckpt/ --vocab_dir ../grok-1 -o x.gguf -t q4_0

make -j && ./main -m x.gguf -p "I believe the meaning of life is" -ngl 0

...

[BOS] I believe the meaning of life is it:000000000000000 of0000000000000000000000000000000000000000000000

Might need more work

Did not work on my machine

convert_grok.py

heiner · 2024-05-09T14:52:33Z

[BOS] I believe the meaning of life is it:000000000000000 of0000000000000000000000000000000000000000000000

My apologies. As I said above I couldn't actually test running the full model on my setup. I will fix @foldl's suggestions.

Would you happen to have something like the sha-1 of each tensor of a checkpoint based on the HF weights? Otherwise I can download those and run that conversion for comparision.

heiner · 2024-05-09T20:50:09Z

Thanks @foldl for the hints. It's well possible I mixed something else up as well, e.g., swapped two tensors with the same shape and dtype. Would you happen to have a tensor name -> hash table for a correct conversion?

foldl · 2024-05-09T23:51:10Z

@heiner You need to compare the result against #6204, although I don't think sha-1 could match.

chatllm.cpp permutes k_proj/q_proj weights, so, sha-1 would not match, either.

heiner · 2024-05-10T10:54:01Z

Thanks. I have removed the multiplication with embedding_multiplier_scale again and converted the full model with all 8 experts. The output is not great but also not as bad as before (gist with full output):

./build/bin/main -m grok.bin -p "I believe the meaning of life is" -s 2 -n 10 -ngl 0
(...)
[BOS] I believe the meaning of life is important it could possibly the general of the general I

It's likely something else is wrong but I'm unsure what it is, and the multiple-hour iteration time makes it infeasible to just try out random things.

As per ggerganov#7058 (comment). This helps avoid a memcopy when running.

This saves weights in the order in which they are in the Grok-1 files. Since we operate weight-by-weight now, we no longer need caches and name2key translations. Per reviewer request, I also moved to using keys in gguf.TENSOR_NAMES.

…ns in llama.cpp.

This makes tensors exactly as in https://huggingface.co/Arki05/Grok-1-GGUF/tree/main/Q8_0

heiner · 2024-05-23T13:42:09Z

I added two more fixes.

I then compared the output of this PR with Arki05/Grok-1-GGUF/Q8_0 from HF via this script.

All tensors are exactly the same now.

(I have to np.stack the expert tensors from this download on axis 0, as they are split there.)

The changes:

The Grok-1 pickle files use lexicographic order (0 < 1 < 10 < 11 ... < 2 < 20 < ...) so the layers were incorrectly ordered.
PyTorch's rounding mode is not away from zero in halfway cases, unlike roundf(3). This made for a difference of 1 for ~0.1% of the int8 entries compared to Arki05/Grok-1-GGUF/Q8_0. Using the new gguf.quantize_q8_0 fixes this in the case of Q8_0 (at the cost of increased conversion time).
Edit: Thanks to @compilade I now added a PyTorch version of quantize_q8_0 (PyTorch is useful here since it allows to figure out shapes and dtypes via meta tensors).

Unfortunately, I cannot run the Arki05/Grok-1-GGUF/Q8_0 weights on my MacBook as it OOMs. I can run a two-expert version of this PR (very slowly, several minutes per token), but the output is not great:

$ ./build/bin/main -m grok.bin -p "The answer to life the universe and everything is" -s 1 -n 4 -ngl 2
...
[BOS] The answer to life the universe and everything is gifted for the of

Could someone with the right hardware run Arki05/Grok-1-GGUF/Q8_0 and see if it's any better? If it is, perhaps I missed some header setting (I didn't see any difference that seemed relevant). Otherwise, I believe this conversion is as good as the quantization supports?

convert_grok.py

This is equivalent to gguf.quantize_q8_0 but doesn't round-trip to Numpy.

foldl · 2024-05-24T11:17:53Z

The Grok-1 pickle files use lexicographic order ...

Nice catch, 😄

heiner added a commit to heiner/llama.cpp that referenced this pull request May 3, 2024

Don't split MoE weights.

72920b1

As per ggerganov#7058 (comment). This helps avoid a memcopy when running.

compilade mentioned this pull request May 4, 2024

convert-hf : save memory with lazy evaluation #7075

Merged

7 tasks

mofosyne added the python python script changes label May 9, 2024

ggerganov previously approved these changes May 9, 2024

View reviewed changes

mofosyne added the review complexity : medium Generally require more time to grok but manageable by beginner to medium expertise level label May 9, 2024

ggerganov marked this pull request as draft May 9, 2024 14:03

foldl suggested changes May 9, 2024

View reviewed changes

convert_grok.py Outdated Show resolved Hide resolved

convert_grok.py Show resolved Hide resolved

mofosyne added the enhancement New feature or request label May 9, 2024

heiner force-pushed the master branch from a03c8a6 to 098c14e Compare May 9, 2024 20:12

heiner force-pushed the master branch from 46e8059 to ee5921e Compare May 21, 2024 21:54

heiner and others added 7 commits May 22, 2024 14:26

Script to convert Grok-1 weights from raw JAX pickle files.

78367d8

Don't split MoE weights.

dea4157

As per ggerganov#7058 (comment). This helps avoid a memcopy when running.

Update convert_grok.py to use logging module

e82acbe

Move print to logging: Fixes.

8060cbd

Address review comments by foldl.

eb92095

Don't multiply embeddings with embedding_multiplier_scale as it happe…

d57efae

…ns in llama.cpp.

heiner added 2 commits May 22, 2024 14:26

Fix layer order.

60fa5a9

Use Q8_0 quantization from gguf module.

cd38c87

This makes tensors exactly as in https://huggingface.co/Arki05/Grok-1-GGUF/tree/main/Q8_0

heiner force-pushed the master branch from ee5921e to cd38c87 Compare May 23, 2024 09:12

heiner added 3 commits May 23, 2024 11:26

More constants from gguf.

ff44041

Write tensors in layer order.

759c4fb

Move noqa comment to where the lastest flake8 likes it.

4eaa704

heiner marked this pull request as ready for review May 23, 2024 13:42

compilade reviewed May 23, 2024

View reviewed changes

convert_grok.py Outdated Show resolved Hide resolved

Implement Q8_0 quantization fully in PyTorch.

f32d590

This is equivalent to gguf.quantize_q8_0 but doesn't round-trip to Numpy.

heiner force-pushed the master branch from e974a68 to f32d590 Compare May 23, 2024 18:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Script to convert Grok-1 weights from raw JAX pickle files. #7058

Script to convert Grok-1 weights from raw JAX pickle files. #7058

heiner commented May 3, 2024 •

edited

slaren commented May 3, 2024

heiner commented May 3, 2024 •

edited

slaren commented May 3, 2024 •

edited

heiner commented May 3, 2024

slaren commented May 3, 2024

heiner commented May 3, 2024

foldl commented May 3, 2024

ggerganov left a comment

ggerganov commented May 9, 2024

heiner commented May 9, 2024

heiner commented May 9, 2024

foldl commented May 9, 2024 •

edited

heiner commented May 10, 2024

heiner commented May 23, 2024 •

edited

foldl commented May 24, 2024

Script to convert Grok-1 weights from raw JAX pickle files. #7058

Are you sure you want to change the base?

Script to convert Grok-1 weights from raw JAX pickle files. #7058

Conversation

heiner commented May 3, 2024 • edited

slaren commented May 3, 2024

heiner commented May 3, 2024 • edited

slaren commented May 3, 2024 • edited

heiner commented May 3, 2024

slaren commented May 3, 2024

heiner commented May 3, 2024

foldl commented May 3, 2024

ggerganov left a comment

Choose a reason for hiding this comment

ggerganov commented May 9, 2024

heiner commented May 9, 2024

heiner commented May 9, 2024

foldl commented May 9, 2024 • edited

heiner commented May 10, 2024

heiner commented May 23, 2024 • edited

foldl commented May 24, 2024

heiner commented May 3, 2024 •

edited

heiner commented May 3, 2024 •

edited

slaren commented May 3, 2024 •

edited

foldl commented May 9, 2024 •

edited

heiner commented May 23, 2024 •

edited