Option to split during conversion #6942

christianazinn · 2024-04-27T04:23:58Z

This PR introduces additional options to convert.py that allow users to split a model into shards while converting rather than having to do it after conversion, including a default small first shard as outlined in #6463.

Other functionality we ought to have includes --split-max-size (so far it's just --split-max-tensors), displaying estimated shard sizes, dry running, and adding sharding for the other convert-*-to-*.py scripts. This will be considered a draft until those are worked out. Also needs considerable testing, but luckily as this deals with the Python scripts, it can be tested easily.

Usage

(examples are using zephyr-smol_llama-100m-sft-full)

Example, `--split-max-size`

python3 convert.py --outfile /path/to/outfile.gguf --outtype f16 /path/to/safetensors --split --split-max-size 64M

Output: equal to what's printed to stdout from master, then

Writing the following files:
    /path/to/outfile-00001-of-00005.gguf: n_tensors = 0, total_size = negligible - metadata only
    /path/to/outfile-00002-of-00005.gguf: n_tensors = 1, total_size = 47.1M
    /path/to/outfile-00003-of-00005.gguf: n_tensors = 11, total_size = 63.6M
    /path/to/outfile-00004-of-00005.gguf: n_tensors = 32, total_size = 63.4M
    /path/to/outfile-00005-of-00005.gguf: n_tensors = 13, total_size = 19.1M

Writing shard 2/5 with 1/57 tensors remaining (of 57 total)
[1/1] Writing tensor output.weight                          | size  32128 x    768  | type F16  | T+   2

Writing shard 3/5 with 11/56 tensors remaining (of 57 total)
[ 1/11] Writing tensor token_embd.weight                      | size  32128 x    768  | type F16  | T+   2
[ 2/11] Writing tensor blk.0.attn_norm.weight                 | size    768           | type F32  | T+   3
[ 3/11] Writing tensor blk.0.ffn_down.weight                  | size    768 x   3072  | type F16  | T+   3
[ 4/11] Writing tensor blk.0.ffn_gate.weight                  | size   3072 x    768  | type F16  | T+   3
[ 5/11] Writing tensor blk.0.ffn_up.weight                    | size   3072 x    768  | type F16  | T+   3
[ 6/11] Writing tensor blk.0.ffn_norm.weight                  | size    768           | type F32  | T+   3
[ 7/11] Writing tensor blk.0.attn_k.weight                    | size    256 x    768  | type F16  | T+   3
[ 8/11] Writing tensor blk.0.attn_output.weight               | size    768 x    768  | type F16  | T+   3
[ 9/11] Writing tensor blk.0.attn_q.weight                    | size    768 x    768  | type F16  | T+   3
[10/11] Writing tensor blk.0.attn_v.weight                    | size    256 x    768  | type F16  | T+   3
[11/11] Writing tensor blk.1.attn_norm.weight                 | size    768           | type F32  | T+   3

Writing shard 4/5 with 32/45 tensors remaining (of 57 total)
[ 1/32] Writing tensor blk.1.ffn_down.weight                  | size    768 x   3072  | type F16  | T+   0
[etc...]

With --split-max-size 200M (or any number greater than the total resultant size), it gives:

Model has smaller size than the split threshold, not splitting

Writing the following files:
    /path/to/outfile.gguf: n_tensors = 57, total_size = 193.2M

[the rest of output is the same as in master]

Example, `--split-max-tensors` with `--dry-run`, `--large-first-shard`

python3 convert.py --outfile /path/to/outfile.gguf --outtype f16 /path/to/safetensors --split --split-max-tensors 20 --dry-run --large-first-shard

Output: equal to what's printed to stdout from master, then

Writing the following files:
    /path/to/outfile-00001-of-00003.gguf: n_tensors = 20, total_size = 127.1M
    /path/to/outfile-00002-of-00003.gguf: n_tensors = 20, total_size = 37.5M
    /path/to/outfile-00003-of-00003.gguf: n_tensors = 17, total_size = 28.5M

Dry run, not writing files

With --split-max-tensors 64 (or any number greater than the total tensor count), it gives:

Model has fewer tensors than the split threshold, not splitting

Writing the following files:
    /path/to/outfile.gguf: n_tensors = 57, total_size = 193.2M

Dry run, not writing files

References

christianazinn · 2024-04-28T01:23:06Z

I've added support for --split-max-size and --dry-run, taking a page out of gguf-split.cpp. Faced with adding split functionality to the convert-*-to-*.py scripts, I wonder whether this should be added to the GGUFWriter class itself rather than to the convert scripts, since it would be tedious to rewrite every write_tensors method in convert-hf-to-gguf.py.

The counterpoint I can see to doing this is that GGUFWriter should only write one file, since it's GGUFWriter and not GGMLWriter. It would also be very annoying to rewrite GGUFWriter, and I'm hesitant to touch the gguf package as a novice. But it's also likely nobody thought of this scenario when creating the file, so perhaps there's good reason to make these changes in the GGUFWriter class. @phymbert thoughts?

phymbert · 2024-04-28T10:11:54Z

This is already a good start. Could you add an end to end usage in the summary?

christianazinn · 2024-04-28T16:59:45Z

Sure thing (I assume you mean examples of usage and expected outputs).

I also plan to rework the implementation by consolidating code into a new GGUFManager class that handles multiple file writes via multiple GGUFWriter instances, so GGUFWriter still only writes to one file. This is because each Model in convert-hf-to-gguf.py has only one instance of GGUFWriter, so splitting would be nearly impossible there. Usage should remain the same, but the code will be fundamentally altered. (I also imagine this could do things to memory usage, so that will need to be heavily tested.)

christianazinn · 2024-04-28T22:27:58Z

I'll need to implement for convert-llama-ggml-to-gguf.py and convert-persimmon-to-gguf.py soon - what are some models that require those scripts for conversion, so I can test? Also, I see convert-lora-to-ggml.py doesn't even use GGUFWriter - is that just for converting LoRA adapters? Is that something we should even add splitting for, considering the small size of LoRA adapters?

Anyway, GGUFManager is implemented as a near drop-in replacement for GGUFWriter that supports file splitting, so far only in convert.py (migrated from my previous commits); support for convert-hf-to-gguf.py is next up.

slaren · 2024-04-28T22:44:14Z

convert-llama-ggml-to-gguf.py is for conversion of pre-gguf models. At this point it could be removed. convert-lora-to-ggml.py doesn't export to gguf format. convert-persimmon-to-gguf.py should probably be integrated into convert-hf-to-gguf.py, but I don't think it needs to be updated.

christianazinn · 2024-04-29T00:41:03Z

Got it - will only implement for convert-hf-to-gguf.py. Remind me to watch memory usage while converting. Since I'm making changes to the gguf package, how will I push those?

slaren · 2024-04-29T00:53:50Z

You can modify the gguf package in the gguf-py directory in this repository. There are instructions for publishing new releases in https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/README.md.

christianazinn · 2024-04-29T01:29:02Z

You can modify the gguf package in the gguf-py directory in this repository

That's what I've been doing so far; will check out instructions to contribute, thanks!

christianazinn · 2024-04-29T01:59:16Z

Testing on Mistral 7B Instruct, this branch's convert.py takes up approximately the same amount of memory as that of master. Will need to check on larger models since the discrepancy was around 6%, 3.6G vs. 3.4G used at max. Obviously memory plays a major role in splitting larger files, which is the entire point of this PR.

christianazinn · 2024-04-30T01:32:19Z

Running tests on my side for all convert-hf-to-gguf.py supported model architectures. What models fall under QWenLMHeadModel - is that just plain QWen 1?

christianazinn · 2024-05-02T00:20:32Z

Will keep track of tests here as I go. Picking one model from each architecture in convert-hf-to-gguf.py as it exists in my branch and testing; will need assistance testing, say, vision models, which I'm not as familiar with. Also note that I went with smaller models to test the architecture; larger models should act the same, but again, tests will be needed.

It also seems like the current convert-hf-to-gguf.py doesn't print tensor status as it goes, which I intend to change.

christianazinn · 2024-05-05T14:59:06Z

Leaving a note for myself to watch merge conflicts with #6511. Development on this branch has slowed down as I'm pretty busy.

christianazinn · 2024-05-05T19:29:37Z

Noting time to convert baichuan-inc/Baichuan2-7B-Chat.

New branch, --split, --split-max-size 4G:
real 6m27.788s
user 1m15.914s
sys 0m46.017s

New branch, no split:
real 7m17.661s
user 1m18.516s
sys 0m44.285s

master:
real 5m57.387s
user 1m14.567s
sys 0m48.403s

Note that these conversions were done writing the outfile over 2.5GbE, so there was considerable time spent just saving the file. Will test more later, but it doesn't seem like the change increases conversion time too significantly.

mofosyne · 2024-05-09T14:26:47Z

Merge attempted. Some ambiguous lines, so @christianazinn should give this a lookover to make sure the intent is still correct.

christianazinn · 2024-05-09T15:34:00Z

I'll check in a few hours and fix conflicts.

christianazinn · 2024-05-10T01:22:43Z

The new get-vocab-base-pre functionality introduced to convert-hf-to-gguf.py by #6920 is throwing me off, but things look fine for the most part. Push incoming for conflict resolution; testing on Refact for convert-hf-to-gguf.py worked and no fundamental changes are required to convert.py. This will remain approximately dormant for another two weeks or so while I focus on finals, but since the code is already almost all implemented, if other people want to pick up and take this PR to the finish line I'd more than appreciate it.

christianazinn · 2024-05-23T00:10:21Z

Finals are over and I can return to testing this branch. Will attempt a merge from master shortly, then retest each model, and if those pass without concern this branch is ready for final(ish) review.

In retrospect, this should not cause any major increases in memory or time usage over the current implementation because 1) there isn't a whole lot of added overhead and 2) Python passes everything by reference anyway as far as I'm aware.

Do people actually use temp files when converting to GGUF? That's more or less the only thing that has to be reimplemented.

I'd mostly like a review on best practices (for instance, all the filewriting logic has to go in a singular write_to_file method, which I can't think of a workaround for and don't imagine will be very amenable in the long run). When this branch is ready to be merged, remind me to also push to the gguf library.

teleprint-me · 2024-05-23T01:46:36Z

I've been waiting on this PR. It should be okay for me to merge it into #7379 without any issues to streamline #6920. convert.py is being moved to reduce confusion in #7430.

christianazinn · 2024-05-23T02:00:59Z

@teleprint-me Hold off on that for a few days - I've just found out merging master has introduced a major memory leak somewhere which I need to squash.

teleprint-me · 2024-05-23T02:05:35Z

@christianazinn No worries. I only merge the master branch in. It's better if you take your time. I'm just thinking ahead. I think this PR is useful and has value. Just do your thing.

christianazinn · 2024-05-23T03:21:20Z

I've narrowed the memory leak down to gguf/lazy.py and thereby #7075. Ironically that PR was meant to save memory (and it does in master).

@compilade could I bother you to have a gander at why every time a tensor is written here, the overall RAM usage increases? It essentially just runs every tensor through an intermediate GGUFManager that handles splitting the tensors across files - just moving lazy tensors around should be no reason for so much RAM to accumulate (up to ten times the model size). It's the line in lazy.py, LazyNumpyTensor, astype(), return type(self)(meta=meta, args=full_args, lazy=self._lazy, func=(lambda a: a[0].astype(*a[1:], **kwargs))) that's taking the most memory (with --no-lazy, it's data = data.astype(np.float16) in convert-hf-to-gguf.py, fwiw).

Frankly, this implementation is a month old at its core, so would it be better to just bin it and restart entirely? It wouldn't be that difficult. @mofosyne would like your opinion too.

compilade · 2024-05-23T07:56:29Z

@compilade could I bother you to have a gander at why every time a tensor is written here, the overall RAM usage increases?

@christianazinn Thanks for letting me know about this. Before I dig deeper into it, my hypothesis is that something holds onto references to tensors that were made "eager".

If there's a list of all tensors, even if they are lazy at first, when they are written to a file (one of the things that can make them eager), the old reference to the LazyNumpyTensors now also point to the calculated data in their _data field. This was done to avoid having to re-calculate stuff when one lazy tensor is re-used in many places, but it assumes references are not kept longer than they should, which might not be true with GGUFManager.

christianazinn · 2024-05-23T22:49:08Z

Before I dig deeper into it, my hypothesis is that something holds onto references to tensors that were made "eager".

Thanks @compilade and you seem to be right - after making sure every reference is deleted when no longer used, the RAM usage normalizes. Pushing a fix that, among others, adds an explicit del tensor to gguf_writer's write_tensors_to_file - should have no adverse side effects because under normal circumstances the garbage collector deals with it anyway. Back to testing!

Also removing all changes to convert.py in light of #7430.

Removed a memory leak caused by unexpected reference retention to eager tensors. Also removed GGUFManager functionality in convert.py in favor of specializing for convert-hf-to-gguf.py.

Instead of having SplitStrategy have a `data` field that is a deque, just have SplitStrategy be a subclass of deque itself.

christianazinn marked this pull request as draft April 27, 2024 04:24

support splits in convert.py

874c341

christianazinn force-pushed the convert-split branch from 26ebf83 to 874c341 Compare April 27, 2024 18:32

Support split by size and dry run to write estimated shards/filesizes

72cbd4e

Move split functionality to new GGUFManager class

702a744

fix improper function signature

c33bdf3

compilade mentioned this pull request May 3, 2024

convert.py: add python logging instead of print() #6511

Merged

phymbert mentioned this pull request May 4, 2024

gguf-split: add --no-tensor-first-split option #7072

Merged

tentative push of convert-hf-to-gguf support

b7c6120

mofosyne added review complexity : medium Generally require more time to grok but manageable by beginner to medium expertise level python python script changes enhancement New feature or request labels May 9, 2024

Merge branch 'master' into convert-split

14b3291

resolve merge + SplitArguments for easier parsing

87a98a5

mofosyne added the help wanted Extra attention is needed label May 10, 2024

Merge remote-tracking branch 'origin' into convert-split

2dd7841

christianazinn added 2 commits May 23, 2024 18:50

Fix eager tensor memory leak and remove convert.py changes

3ff27ef

Removed a memory leak caused by unexpected reference retention to eager tensors. Also removed GGUFManager functionality in convert.py in favor of specializing for convert-hf-to-gguf.py.

refactor SplitStrategy to be a deque

6b5c375

Instead of having SplitStrategy have a `data` field that is a deque, just have SplitStrategy be a subclass of deque itself.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to split during conversion #6942

Option to split during conversion #6942

christianazinn commented Apr 27, 2024 •

edited

christianazinn commented Apr 28, 2024

phymbert commented Apr 28, 2024

christianazinn commented Apr 28, 2024 •

edited

christianazinn commented Apr 28, 2024

slaren commented Apr 28, 2024

christianazinn commented Apr 29, 2024

slaren commented Apr 29, 2024

christianazinn commented Apr 29, 2024

christianazinn commented Apr 29, 2024

christianazinn commented Apr 30, 2024

christianazinn commented May 2, 2024 •

edited

christianazinn commented May 5, 2024

christianazinn commented May 5, 2024

mofosyne commented May 9, 2024

christianazinn commented May 9, 2024 •

edited

christianazinn commented May 10, 2024

christianazinn commented May 23, 2024

teleprint-me commented May 23, 2024 •

edited

christianazinn commented May 23, 2024

teleprint-me commented May 23, 2024 •

edited

christianazinn commented May 23, 2024 •

edited

compilade commented May 23, 2024

christianazinn commented May 23, 2024

Option to split during conversion #6942

Are you sure you want to change the base?

Option to split during conversion #6942

Conversation

christianazinn commented Apr 27, 2024 • edited

Usage

Example, --split-max-size

Example, --split-max-tensors with --dry-run, --large-first-shard

christianazinn commented Apr 28, 2024

phymbert commented Apr 28, 2024

christianazinn commented Apr 28, 2024 • edited

christianazinn commented Apr 28, 2024

slaren commented Apr 28, 2024

christianazinn commented Apr 29, 2024

slaren commented Apr 29, 2024

christianazinn commented Apr 29, 2024

christianazinn commented Apr 29, 2024

christianazinn commented Apr 30, 2024

christianazinn commented May 2, 2024 • edited

christianazinn commented May 5, 2024

christianazinn commented May 5, 2024

mofosyne commented May 9, 2024

christianazinn commented May 9, 2024 • edited

christianazinn commented May 10, 2024

christianazinn commented May 23, 2024

teleprint-me commented May 23, 2024 • edited

christianazinn commented May 23, 2024

teleprint-me commented May 23, 2024 • edited

christianazinn commented May 23, 2024 • edited

compilade commented May 23, 2024

christianazinn commented May 23, 2024

christianazinn commented Apr 27, 2024 •

edited

Example, `--split-max-size`

Example, `--split-max-tensors` with `--dry-run`, `--large-first-shard`

christianazinn commented Apr 28, 2024 •

edited

christianazinn commented May 2, 2024 •

edited

christianazinn commented May 9, 2024 •

edited

teleprint-me commented May 23, 2024 •

edited

teleprint-me commented May 23, 2024 •

edited

christianazinn commented May 23, 2024 •

edited