New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option to split during conversion #6942
base: master
Are you sure you want to change the base?
Conversation
26ebf83
to
874c341
Compare
I've added support for The counterpoint I can see to doing this is that |
This is already a good start. Could you add an end to end usage in the summary? |
Sure thing (I assume you mean examples of usage and expected outputs). I also plan to rework the implementation by consolidating code into a new |
I'll need to implement for Anyway, |
|
Got it - will only implement for |
You can modify the gguf package in the |
That's what I've been doing so far; will check out instructions to contribute, thanks! |
Testing on Mistral 7B Instruct, this branch's |
Running tests on my side for all |
Will keep track of tests here as I go. Picking one model from each architecture in It also seems like the current
|
Leaving a note for myself to watch merge conflicts with #6511. Development on this branch has slowed down as I'm pretty busy. |
Noting time to convert baichuan-inc/Baichuan2-7B-Chat. New branch, New branch, no split: master: Note that these conversions were done writing the outfile over 2.5GbE, so there was considerable time spent just saving the file. Will test more later, but it doesn't seem like the change increases conversion time too significantly. |
Merge attempted. Some ambiguous lines, so @christianazinn should give this a lookover to make sure the intent is still correct. |
I'll check in a few hours and fix conflicts. |
The new |
Finals are over and I can return to testing this branch. Will attempt a merge from master shortly, then retest each model, and if those pass without concern this branch is ready for final(ish) review. In retrospect, this should not cause any major increases in memory or time usage over the current implementation because 1) there isn't a whole lot of added overhead and 2) Python passes everything by reference anyway as far as I'm aware. Do people actually use temp files when converting to GGUF? That's more or less the only thing that has to be reimplemented. I'd mostly like a review on best practices (for instance, all the filewriting logic has to go in a singular |
@teleprint-me Hold off on that for a few days - I've just found out merging |
@christianazinn No worries. I only merge the master branch in. It's better if you take your time. I'm just thinking ahead. I think this PR is useful and has value. Just do your thing. |
I've narrowed the memory leak down to @compilade could I bother you to have a gander at why every time a tensor is written here, the overall RAM usage increases? It essentially just runs every tensor through an intermediate Frankly, this implementation is a month old at its core, so would it be better to just bin it and restart entirely? It wouldn't be that difficult. @mofosyne would like your opinion too. |
@christianazinn Thanks for letting me know about this. Before I dig deeper into it, my hypothesis is that something holds onto references to tensors that were made "eager". If there's a list of all tensors, even if they are lazy at first, when they are written to a file (one of the things that can make them eager), the old reference to the LazyNumpyTensors now also point to the calculated data in their |
Thanks @compilade and you seem to be right - after making sure every reference is deleted when no longer used, the RAM usage normalizes. Pushing a fix that, among others, adds an explicit Also removing all changes to |
Removed a memory leak caused by unexpected reference retention to eager tensors. Also removed GGUFManager functionality in convert.py in favor of specializing for convert-hf-to-gguf.py.
Instead of having SplitStrategy have a `data` field that is a deque, just have SplitStrategy be a subclass of deque itself.
This PR introduces additional options to
convert.py
that allow users to split a model into shards while converting rather than having to do it after conversion, including a default small first shard as outlined in #6463.Other functionality we ought to have includes
--split-max-size
(so far it's just--split-max-tensors
), displaying estimated shard sizes, dry running, and adding sharding for the otherconvert-*-to-*.py
scripts. This will be considered a draft until those are worked out. Also needs considerable testing, but luckily as this deals with the Python scripts, it can be tested easily.Usage
(examples are using zephyr-smol_llama-100m-sft-full)
Example,
--split-max-size
python3 convert.py --outfile /path/to/outfile.gguf --outtype f16 /path/to/safetensors --split --split-max-size 64M
Output: equal to what's printed to stdout from
master
, thenWith
--split-max-size 200M
(or any number greater than the total resultant size), it gives:Example,
--split-max-tensors
with--dry-run
,--large-first-shard
python3 convert.py --outfile /path/to/outfile.gguf --outtype f16 /path/to/safetensors --split --split-max-tensors 20 --dry-run --large-first-shard
Output: equal to what's printed to stdout from
master
, thenWith
--split-max-tensors 64
(or any number greater than the total tensor count), it gives:References
gguf-split
add a default option to not include tensors data in first shard #6463