Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT / Trainer: Experimental feature - add GGUF conversion when pushing the model to Hub #30928

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

younesbelkada
Copy link
Contributor

@younesbelkada younesbelkada commented May 21, 2024

What does this PR do?

Introduces a new quantization_config that is intended to be used only for trainer.push_to_hub(), it calls ` a GGUF conversion Space under the hood - (for now: https://huggingface.co/spaces/ybelkada/gguf-my-repo that we will move to a more "official" org)

The API looks like the following:

from datasets import load_dataset
from trl import SFTTrainer, SFTConfig
from transformers import TrainingArguments, GgufConfig

dataset = load_dataset("imdb", split="train")

args = SFTConfig(output_dir="test-gguf-trainer")

trainer = SFTTrainer(
    "ybelkada/tiny-random-llama",
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=512,
    args=args
)

gguf_config = GgufConfig(
    quantization_method="Q8_0",
    private=True,
    space_name="ybelkada/gguf-my-repo"
)

trainer.push_to_hub(gguf_config=gguf_config)

cc @amyeroberts @Vaibhavs10

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@younesbelkada younesbelkada changed the title FEAT / Trainer: Add GGUF conversion when pushing the model to Hub FEAT / Trainer: Experimental feature - add GGUF conversion when pushing the model to Hub May 28, 2024
def __init__(
self,
quantization_method: str,
space_name: Optional[str] = "ggml-org/gguf-my-repo",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
space_name: Optional[str] = "ggml-org/gguf-my-repo",
space_name: Optional[str] = "ggml-org/gguf-my-repo-trainer",

@younesbelkada
Copy link
Contributor Author

Would love to get a first round of review before adding tests and documentation ! cc @amyeroberts @muellerzr

@younesbelkada younesbelkada requested review from amyeroberts and muellerzr and removed request for amyeroberts May 28, 2024 16:27
@younesbelkada
Copy link
Contributor Author

Example generated GGUF repo: https://huggingface.co/ybelkada/test-gguf-trainer-Q8_0-GGUF

Copy link
Contributor

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this makes sense from a it works perspective, but is there a reason why we are dedicating a space to this rather than using something like a Hub util? (or baking this in here, for an offline util instead of online)

E.g. I'd rather this be able to be applied locally rather than on the Hub personally... (ran into this same thought process with safetensors the other day)

Going transformers -> gradio -> hf_hub seems a bit middle-man to me.

@younesbelkada
Copy link
Contributor Author

This is possible but would require quite a huge work, the conversion script maintained by the community: https://github.com/ggerganov/llama.cpp/blob/master/convert.py evolves quite quickly, which would make maintaining that feature quite hard. On the other hand, the official Space uses docker and is build on that conversion script: https://huggingface.co/spaces/ggml-org/gguf-my-repo/blob/main/app.py#L102
one direction could be to build a separate python package / or submodule within transformers to convert locally transformers model to GGUF, with the caveat being the maintenance burden between the official llama.cpp conversion script and our submodule. What do you think? cc @Vaibhavs10 as well

@younesbelkada
Copy link
Contributor Author

cc @LysandreJik wdyt ?

Copy link
Member

@Vaibhavs10 Vaibhavs10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like a good approach to me too! With Quantisation we have seen that the script evolves wayyy too quickly sometimes from day to day.

That's why the initial recommendation was for it to be used via the space! That said, for offline use-cases perhaps we just use the convert script and update it periodically via a corn - maybe everyday?

This way the only script we'd need to update would be this: https://github.com/ggerganov/llama.cpp/blob/master/convert-hf-to-gguf.py

Then we use this to convert to fp16 and the quantisation code remains the same.

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening this!

Comment on lines +1052 to +1058
space_name (`str`, *optional*, defaults to `"ggml-org/gguf-my-repo"`):
Hub ID of the conversion space to use, modify it at your own risk!
private (`bool`, *optional*, defaults to `False`):
Whether to push the final GGUF file on a private repository.
duplicate_space (`bool`, *optional*, defaults to `False`):
Whether to duplicate the conversion space or not. This is useful to avoid queue issues if one uses the official
Space.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't think either of these arguments make sense to be included as part of a "quantization config" - they're not configuring quantization, they're configuring a space. It also creates a clash of arguments with the current trainer. If I have private=True in this config and call hub_private_repo=False in the training arguments, which one is should be respected?

I would move these somewhere else (how configurable do we need them to be?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yes makes sense, let me think how best we could address that

if not isinstance(gguf_config, GgufConfig):
raise ValueError("Please pass an instance of `GgufConfig`")

from gradio_client import Client
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need some protection around whether gradio_client is available

Comment on lines +4303 to +4318
if gguf_config.duplicate_space:
client = Client.duplicate(gguf_config.space_name, hf_token=get_token_to_send(token))
else:
client = Client(gguf_config.space_name)

job = client.submit(
model_id=self.hub_model_id,
q_method=gguf_config.quantization_method,
private_repo=gguf_config.private,
token=get_token_to_send(token),
api_name="/predict",
)

logger.info(
f"GGUF conversion Space is running with the status {job.status().code} - please monitor your HF profile to see if the converted model has been pushed"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmmm.... I'm really not sure about this, my gut feeling is that it's pretty dirty piggy-backing gradio here.

I'd like for this general logic to be moved out to a function rather than being injected directly into a core trainer method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, will move that out in a separate method !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants