-
Notifications
You must be signed in to change notification settings - Fork 25.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT / Trainer: Experimental feature - add GGUF
conversion when pushing the model to Hub
#30928
base: main
Are you sure you want to change the base?
FEAT / Trainer: Experimental feature - add GGUF
conversion when pushing the model to Hub
#30928
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
GGUF
conversion when pushing the model to HubGGUF
conversion when pushing the model to Hub
def __init__( | ||
self, | ||
quantization_method: str, | ||
space_name: Optional[str] = "ggml-org/gguf-my-repo", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space_name: Optional[str] = "ggml-org/gguf-my-repo", | |
space_name: Optional[str] = "ggml-org/gguf-my-repo-trainer", |
Would love to get a first round of review before adding tests and documentation ! cc @amyeroberts @muellerzr |
Example generated GGUF repo: https://huggingface.co/ybelkada/test-gguf-trainer-Q8_0-GGUF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this makes sense from a it works perspective, but is there a reason why we are dedicating a space to this rather than using something like a Hub util? (or baking this in here, for an offline util instead of online)
E.g. I'd rather this be able to be applied locally rather than on the Hub personally... (ran into this same thought process with safetensors the other day)
Going transformers
-> gradio
-> hf_hub
seems a bit middle-man to me.
This is possible but would require quite a huge work, the conversion script maintained by the community: https://github.com/ggerganov/llama.cpp/blob/master/convert.py evolves quite quickly, which would make maintaining that feature quite hard. On the other hand, the official Space uses docker and is build on that conversion script: https://huggingface.co/spaces/ggml-org/gguf-my-repo/blob/main/app.py#L102 |
cc @LysandreJik wdyt ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like a good approach to me too! With Quantisation we have seen that the script evolves wayyy too quickly sometimes from day to day.
That's why the initial recommendation was for it to be used via the space! That said, for offline use-cases perhaps we just use the convert script and update it periodically via a corn - maybe everyday?
This way the only script we'd need to update would be this: https://github.com/ggerganov/llama.cpp/blob/master/convert-hf-to-gguf.py
Then we use this to convert to fp16
and the quantisation code remains the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for opening this!
space_name (`str`, *optional*, defaults to `"ggml-org/gguf-my-repo"`): | ||
Hub ID of the conversion space to use, modify it at your own risk! | ||
private (`bool`, *optional*, defaults to `False`): | ||
Whether to push the final GGUF file on a private repository. | ||
duplicate_space (`bool`, *optional*, defaults to `False`): | ||
Whether to duplicate the conversion space or not. This is useful to avoid queue issues if one uses the official | ||
Space. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't think either of these arguments make sense to be included as part of a "quantization config" - they're not configuring quantization, they're configuring a space. It also creates a clash of arguments with the current trainer. If I have private=True in this config and call hub_private_repo=False in the training arguments, which one is should be respected?
I would move these somewhere else (how configurable do we need them to be?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yes makes sense, let me think how best we could address that
if not isinstance(gguf_config, GgufConfig): | ||
raise ValueError("Please pass an instance of `GgufConfig`") | ||
|
||
from gradio_client import Client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need some protection around whether gradio_client is available
if gguf_config.duplicate_space: | ||
client = Client.duplicate(gguf_config.space_name, hf_token=get_token_to_send(token)) | ||
else: | ||
client = Client(gguf_config.space_name) | ||
|
||
job = client.submit( | ||
model_id=self.hub_model_id, | ||
q_method=gguf_config.quantization_method, | ||
private_repo=gguf_config.private, | ||
token=get_token_to_send(token), | ||
api_name="/predict", | ||
) | ||
|
||
logger.info( | ||
f"GGUF conversion Space is running with the status {job.status().code} - please monitor your HF profile to see if the converted model has been pushed" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmmm.... I'm really not sure about this, my gut feeling is that it's pretty dirty piggy-backing gradio here.
I'd like for this general logic to be moved out to a function rather than being injected directly into a core trainer method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, will move that out in a separate method !
What does this PR do?
Introduces a new
quantization_config
that is intended to be used only fortrainer.push_to_hub()
, it calls ` a GGUF conversion Space under the hood - (for now: https://huggingface.co/spaces/ybelkada/gguf-my-repo that we will move to a more "official" org)The API looks like the following:
cc @amyeroberts @Vaibhavs10