Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama-3 Inference and Uploading to Huggingface #931

Open
fabriceyhc opened this issue May 3, 2024 · 18 comments
Open

Llama-3 Inference and Uploading to Huggingface #931

fabriceyhc opened this issue May 3, 2024 · 18 comments
Assignees

Comments

@fabriceyhc
Copy link

I'm trying to fine-tune Llama-3-8B and 70B with LoRA on a custom drug detection dataset and upload it to Huggingface so that it fits nicely into an existing evaluation zero-shot evaluation pipeline. My current challenge lies in converting the different checkpoints - 8B uses FullModelMetaCheckpointer that outputs a meta_model_{i}.pt, whereas 70B uses FullModelHFCheckpointer and outputs hf_model_{idx}_{i}.pt where i is for each epoch (over 5 epochs).

Question 1: Is it safe to assume that we only need the checkpoint with the highest i index and can delete the intermediate ones? If so, it would be handy to have a config option to conserve diskspace by keeping the most recent checkpoint only.

As discussed in #832, we need to convert the 8B model in llama format to hf format. To do this, I've had to move a lot of the contents from the original folder into a the output_dir (e.g. tokenzier.model, etc) and then run a script from transformers (here).

python src/transformers/models/llama/convert_llama_weights_to_hf.py \
--input_dir <checkpoint_dir> \
--llama_version 3 \
--model_size 8B \
--output_dir <hf_staging_dir> 

This script is specifically looking for model weights from a file called consolidated.00.pth (L161), which is the original, untrained 8B Llama-3. It's not clear to me how to have it use the lora merged meta_model_5.pt. When I tried to follow the instructions from the e2e example and just upload <checkpoint_dir> directly, HF errors out saying it can't find a file with a suitable format (e.g. pytorch_model.bin, etc).

Question 2: How can we convert both the 8B and 70B versions of the lora fine-tuned LLama-3s so that they are suitable for inference via HF?

@monk1337
Copy link

monk1337 commented May 6, 2024

@fabriceyhc You have to change the file name in the script to

        loaded = [
            torch.load(os.path.join(input_base_path, f"hf_model_{i:04d}_0.pt"), map_location="cpu")
            for i in range(1, num_shards + 1)  # Start from 1 and go up to num_shards
           ]

+1 because the naming convention start with 0001 for 70B

Hi @joecummings , could you please take a look at the script that I've been working on? I made some changes to the names, but now I'm encountering new errors when I try to convert it into hf weights for 70B. However, the conversion seems to work fine for 8B. At the moment, the issue appears to be specific to torchtune. Once we fine-tune the model, it would be great if there's a command line option available to convert the pt weights to hf and upload them to Hugging Face. From there, we can proceed with other tasks as the entire ecosystem is based around Hugging Face.

https://gist.github.com/monk1337/925a5a44c431ed1f1d3927141f31b6d2

@optimass
Copy link

Once we fine-tune the model, it would be great if there's a command line option available to convert the pt weights to hf and upload them to Hugging Face. From there, we can proceed with other tasks as the entire ecosystem is based around Hugging Face.

totally agree that we need this!

@optimass
Copy link

optimass commented May 10, 2024

https://gist.github.com/monk1337/925a5a44c431ed1f1d3927141f31b6d2
I tried this w/ for LLAMA-3-8b and got the following error:

File "/home/toolkit/ui-copilot/finetuning/utils/convert_llama_weights_to_hf2.py", line 447, in main
    write_model(
  File "/home/toolkit/ui-copilot/finetuning/utils/convert_llama_weights_to_hf2.py", line 195, in write_model
    f"model.layers.{layer_i}.input_layernorm.weight": loaded[0][
    KeyError: 'layers.0.attention_norm.weight'

@kartikayk
Copy link
Contributor

Hey! Sorry you're running into issues here. I didn't realize there are differences in how the 8B and 70B models are converted. Let me take a look into this in a bit.

@kartikayk
Copy link
Contributor

We have this function in the checkpointer, but seems like this isn't getting the job done. So need to figure out why that is.

@kartikayk
Copy link
Contributor

@fabriceyhc Some thoughts on the questions you asked above:

My current challenge lies in converting the different checkpoints - 8B uses FullModelMetaCheckpointer that outputs a meta_model_{i}.pt, whereas 70B uses FullModelHFCheckpointer and outputs hf_model_{idx}_{i}.pt where i is for each epoch (over 5 epochs)

The checkpointer used depends on the input checkpoint format. The 8B model makes uses of the consolidated.00.pth file which is in Meta format. But you can update this to use the safetensors checkpoints and use the HFCheckpointer instead. This should address the discrepancy in configs between 8B and 70B.

Is it safe to assume that we only need the checkpoint with the highest i index and can delete the intermediate ones? If so, it would be handy to have a config option to conserve diskspace by keeping the most recent checkpoint only.

yes this is the right understanding. Adding this flag has been on our TODO list for a while now. If you'd be open to contributing this as a PR, I'd be happy to collaborate with you on the review.

How can we convert both the 8B and 70B versions of the lora fine-tuned LLama-3s so that they are suitable for inference via HF?

As I commented above, the FullModelCheckpointer does this conversion for you. But seems like you're still running into issues?

@optimass
Copy link

As I commented above, the FullModelCheckpointer does this conversion for you. But seems like you're still running into issues?

Yes, it's still unclear to me how to use the FullModelHFCheckpointer's outputs to HF's APIs, in particular Text-Generation Inference (TGI).

@SoshyHayami
Copy link

@kartikayk
I'm having this same issue, but on the full fine tuned checkpoint. i can't go back and re-train the model with a new checkpointer (i used meta's checkpointer, as it was the default in the configs) it'll be very costly to me. now I got a .pt file I can't seem to use. any suggestion what I have to do here?

I need it in the safetensors format. I used the converter provided by the hugging face but it only supports .pth, so i changed that part in the script, it converts but the performance seem to be a lot worse than using the .pt directly using the 'torch run generate'.

anyway, I need your help here. thanks

@ebsmothers
Copy link
Contributor

Thanks @SoshyHayami for providing additional details here, @joecummings is gonna take a look at this

@fabriceyhc
Copy link
Author

fabriceyhc commented Jun 10, 2024

Hey @SoshyHayami, I figured out a workaround based on #878. Inside your training folder, you should have a file called model.safetensors.index.json. All you need to do is edit the weight_map values to point to your .pt files. Huggingface will still be able to read them. I've confirmed upload to hf and can download the model onto other servers just like any other without issues.

Here's what my file looks like after manually editing it: model.safetensors.index.json

Since I fine-tuned my model for 16 epochs, my idx is 15 (e.g. hf_model_0030_15.pt). You'll just have to change yours to whatever the index on your end is.

@SoshyHayami
Copy link

@fabriceyhc
Thanks for the tip. the torchtune uses the original llama checkpoint to initiate a training session and I didn't use HF checkpointer to save the model. so I'm not sure if it'll work.

@joecummings
Copy link
Contributor

@kartikayk I'm having this same issue, but on the full fine tuned checkpoint. i can't go back and re-train the model with a new checkpointer (i used meta's checkpointer, as it was the default in the configs) it'll be very costly to me. now I got a .pt file I can't seem to use. any suggestion what I have to do here?

I need it in the safetensors format. I used the converter provided by the hugging face but it only supports .pth, so i changed that part in the script, it converts but the performance seem to be a lot worse than using the .pt directly using the 'torch run generate'.

anyway, I need your help here. thanks

Can you confirm that you have a single output file for your finetuned model named something like meta_model_X.pt

Also, what convert script did you use exactly to convert to safetensors? I see this one but it would need to be modified to use a local file instead of pulling from the HF Hub.

Lastly, what do you mean by performance seems to be a lot worse? What are you using to evaluate performance in this case?

@SoshyHayami
Copy link

@kartikayk I'm having this same issue, but on the full fine tuned checkpoint. i can't go back and re-train the model with a new checkpointer (i used meta's checkpointer, as it was the default in the configs) it'll be very costly to me. now I got a .pt file I can't seem to use. any suggestion what I have to do here?
I need it in the safetensors format. I used the converter provided by the hugging face but it only supports .pth, so i changed that part in the script, it converts but the performance seem to be a lot worse than using the .pt directly using the 'torch run generate'.
anyway, I need your help here. thanks

Can you confirm that you have a single output file for your finetuned model named something like meta_model_X.pt

Also, what convert script did you use exactly to convert to safetensors? I see this one but it would need to be modified to use a local file instead of pulling from the HF Hub.

Lastly, what do you mean by performance seems to be a lot worse? What are you using to evaluate performance in this case?

1- yes. meta_model_1.pt

2- No, I used the one from the transformers repo. the one the OP mentioned using in his first post. they support Llama 3 conversion but with the .pth format. I just hard-coded that line of code where you load the model with my own checkpoint's absolute path.

3- just eyeballing it, I'm not particularly sure about that, but It does seem so. there's a lot of repetition, the model hallucinates really bad even on english and the llama's prompt template can be seen in the output. there's less of that when directly using the tune generate. but regardless whether I'm right or wrong on this third point, I need a sure-way to convert this .pt to .safetensors.

@JonasQN
Copy link

JonasQN commented Jun 10, 2024

3- just eyeballing it, I'm not particularly sure about that, but It does seem so. there's a lot of repetition, the model hallucinates really bad even on english and the llama's prompt template can be seen in the output. there's less of that when directly using the tune generate. but regardless whether I'm right or wrong on this third point, I need a sure-way to convert this .pt to .safetensors.

I have similar experience for this issue, I'm using the exact same prompt but the output from .safetensors conversion is really off.

@SoshyHayami
Copy link

SoshyHayami commented Jun 10, 2024

3- just eyeballing it, I'm not particularly sure about that, but It does seem so. there's a lot of repetition, the model hallucinates really bad even on english and the llama's prompt template can be seen in the output. there's less of that when directly using the tune generate. but regardless whether I'm right or wrong on this third point, I need a sure-way to convert this .pt to .safetensors.

I have similar experience for this issue, I'm using the exact same prompt but the output from .safetensors conversion is really off.

it's either the conversion or there's something wrong with the training process. (not necessarily with torch tune, but rather data or template, using instruct model instead of the base model, etc.)
I no longer have time or the compute to test it, but if you can, try axolotl to see if that gives you better results, if you did I appreciate it if you let me know.

@joecummings
Copy link
Contributor

3- just eyeballing it, I'm not particularly sure about that, but It does seem so. there's a lot of repetition, the model hallucinates really bad even on english and the llama's prompt template can be seen in the output. there's less of that when directly using the tune generate. but regardless whether I'm right or wrong on this third point, I need a sure-way to convert this .pt to .safetensors.

I have similar experience for this issue, I'm using the exact same prompt but the output from .safetensors conversion is really off.

Can you expand on "really off"? Also while I'm investigating this issue, I'm using the following code snippet to check the converted safetensor files on the HF from_pretrained side:

from transformers import LlamaForCausalLM, PreTrainedTokenizerFast
from torchtune.utils import set_seed
import torch

set_seed(1234)

with torch.device("cuda"):
    model = LlamaForCausalLM.from_pretrained("/path/to/converted/model")
    tokenizer = PreTrainedTokenizerFast.from_pretrained("/path/to/converted/model")

    tokens = tokenizer("Tell me a joke", return_tensors="pt")
    outputs = model.generate(**tokens, top_k=300, max_new_tokens=300, do_sample=True, temperature=0.6)

    tokenizer.batch_decode(outputs, skip_special_tokens=True)

If this does not resemble your workflow, please let me know.

@JonasQN
Copy link

JonasQN commented Jun 11, 2024

3- just eyeballing it, I'm not particularly sure about that, but It does seem so. there's a lot of repetition, the model hallucinates really bad even on english and the llama's prompt template can be seen in the output. there's less of that when directly using the tune generate. but regardless whether I'm right or wrong on this third point, I need a sure-way to convert this .pt to .safetensors.

I have similar experience for this issue, I'm using the exact same prompt but the output from .safetensors conversion is really off.

Can you expand on "really off"? Also while I'm investigating this issue, I'm using the following code snippet to check the converted safetensor files on the HF from_pretrained side:

from transformers import LlamaForCausalLM, PreTrainedTokenizerFast
from torchtune.utils import set_seed
import torch

set_seed(1234)

with torch.device("cuda"):
    model = LlamaForCausalLM.from_pretrained("/path/to/converted/model")
    tokenizer = PreTrainedTokenizerFast.from_pretrained("/path/to/converted/model")

    tokens = tokenizer("Tell me a joke", return_tensors="pt")
    outputs = model.generate(**tokens, top_k=300, max_new_tokens=300, do_sample=True, temperature=0.6)

    tokenizer.batch_decode(outputs, skip_special_tokens=True)

If this does not resemble your workflow, please let me know.

I'm using the similar code, the result I got for my prompt was just repeating my prompt and starting to generate these (I don't have any Russian data in my finetuning dataset): прикладприкладприкладприкладприкладприкладприкладприкладприкладприкладприкладприкладприкладприкладприкладпр... Whenever I'm using the torchtune's generation script it works completely fine

@JonasQN
Copy link

JonasQN commented Jun 11, 2024

3- just eyeballing it, I'm not particularly sure about that, but It does seem so. there's a lot of repetition, the model hallucinates really bad even on english and the llama's prompt template can be seen in the output. there's less of that when directly using the tune generate. but regardless whether I'm right or wrong on this third point, I need a sure-way to convert this .pt to .safetensors.

I have similar experience for this issue, I'm using the exact same prompt but the output from .safetensors conversion is really off.

it's either the conversion or there's something wrong with the training process. (not necessarily with torch tune, but rather data or template, using instruct model instead of the base model, etc.) I no longer have time or the compute to test it, but if you can, try axolotl to see if that gives you better results, if you did I appreciate it if you let me know.

I was able to generate for my test data by modifying the generate.py from the recipe folder, maybe you could try to do that too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants