Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for LLaMA-2 70B with Grouped-Query Attention #91

Open
kaiwang13 opened this issue Jul 21, 2023 · 18 comments
Open

Support for LLaMA-2 70B with Grouped-Query Attention #91

kaiwang13 opened this issue Jul 21, 2023 · 18 comments
Assignees
Labels
bug Something isn't working

Comments

@kaiwang13
Copy link

kaiwang13 commented Jul 21, 2023

Due to the Grouped-Query Attention introduced in LLaMA-2 70B,llama issue,it cannot be loaded into the collie implementation of LLaMA. Hope LLaMA-2 70B can be support in collie. Thanks

Traceback (most recent call last):
  File "/nvme1/gptdata/share1/projects/collie/examples/download.py", line 49, in <module>
    model = LlamaForCausalLM.from_pretrained(model_name, config=config)
  File "/nvme1/gptdata/share1/app/mambaforge/envs/collie/lib/python3.9/site-packages/collie/models/base.py", line 306, in from_pretrained
    state_dict = cls.load_parallel_state_dict(
  File "/nvme1/gptdata/share1/app/mambaforge/envs/collie/lib/python3.9/site-packages/collie/models/llama/model.py", line 414, in load_parallel_state_dict
    part_state_dict[key] = rearrange(
RuntimeError: shape '[8192, 8192]' is invalid for input of size 8388608
@kaiwang13 kaiwang13 changed the title LLaMA-2 70B GQA支持 Support for LLaMA-2 70B with Grouped-Query Attention Jul 21, 2023
@dittops
Copy link

dittops commented Jul 24, 2023

I got the same error for LLaMa2 70B

@dittops
Copy link

dittops commented Jul 25, 2023

@kaiwang13 Could you please share how you resolved the issue?

@kaiwang13
Copy link
Author

@kaiwang13 Could you please share how you resolved the issue?

Just uninstall the old version and install the latest one from source code.

@dittops
Copy link

dittops commented Jul 25, 2023

I just cloned the repo and installed it from the main branch. But I'm still facing the error. Do I need to install it from any specific branch?

@kaiwang13 kaiwang13 reopened this Jul 25, 2023
@kaiwang13
Copy link
Author

I just cloned the repo and installed it from the main branch. But I'm still facing the error. Do I need to install it from any specific branch?

Remove https://github.com/OpenLMLab/collie/blob/c9cc0055a52b96d156450b5734a0a1d0dbde4562/collie/models/llama/model.py#L425C1-L432C64

@kaiwang13 kaiwang13 reopened this Jul 25, 2023
@dittops
Copy link

dittops commented Jul 25, 2023

This resolved the issue of the shape error. But now it seems offloading to CPU memory and the process get killed because of CPU overloading. I have 450 GB of CPU memory and 4*A100 80 GB. I'm using the LOMO optimizer. Is this expected for LOMO?

@kaiwang13
Copy link
Author

This resolved the issue of the shape error. But now it seems offloading to CPU memory and the process get killed because of CPU overloading. I have 450 GB of CPU memory and 4*A100 80 GB. I'm using the LOMO optimizer. Is this expected for LOMO?

What I did cannot solve the problem. The pretrained state dict cannot be loaded for training without thos code.

@dittops
Copy link

dittops commented Jul 29, 2023

okay, so is there any suggestion to solve the problem?

@00INDEX 00INDEX added the bug Something isn't working label Aug 1, 2023
@x54-729
Copy link
Contributor

x54-729 commented Aug 2, 2023

This resolved the issue of the shape error. But now it seems offloading to CPU memory and the process get killed because of CPU overloading. I have 450 GB of CPU memory and 4*A100 80 GB. I'm using the LOMO optimizer. Is this expected for LOMO?

What I did cannot solve the problem. The pretrained state dict cannot be loaded for training without thos code.

Do you mean that the error occurs when loading pretrained state dict? Could you please show the error log?
Sorry for the late reply

@dittops
Copy link

dittops commented Aug 2, 2023

I was testing this with the main branch. While loading the state dict, the CPU is taking around 550GB. Yesterday I tried that in a larger instance with 900GB CPU and it got some shape error during the train start as @kaiwang13 mentioned. I don't have access to that machine right now to share the log.

However, yesterday I tested the dev branch as well. In the dev branch, the CPU was taking only around 150GB. But there, I was getting OOM while saving the checkpoint after the first epoch. You can check issue #98 about this.

Let me know if this info is enough for you to proceed further

@x54-729
Copy link
Contributor

x54-729 commented Aug 2, 2023

I was testing this with the main branch. While loading the state dict, the CPU is taking around 550GB. Yesterday I tried that in a larger instance with 900GB CPU and it got some shape error during the train start as @kaiwang13 mentioned. I don't have access to that machine right now to share the log.

However, yesterday I tested the dev branch as well. In the dev branch, the CPU was taking only around 150GB. But there, I was getting OOM while saving the checkpoint after the first epoch. You can check issue #98 about this.

Let me know if this info is enough for you to proceed further

Thanks for your information! The latest llama-v2 is not updated to main branch yet, so it is normal to raise errors in main branch.
We will test the process of saving checkpoint later. Does OOM occur when using zero3 and LOMO when saving checkpoint? And is it caused by GPU OOM or CPU OOM?

@dittops
Copy link

dittops commented Aug 2, 2023

Yes, I was using zero3 and LOMO. And I was getting GPU OOM while saving

@x54-729
Copy link
Contributor

x54-729 commented Aug 2, 2023

Yes, I was using zero3 and LOMO. And I was getting GPU OOM while saving

Thanks a lot! We will try to fix it

@kaiwang13
Copy link
Author

Yes, I was using zero3 and LOMO. And I was getting GPU OOM while saving

Thanks a lot! We will try to fix it

@dittops @x54-729 Additionally, I tried training llama1-33b with a sequence length of 2048 and a batch size of 1 using AdamW with zero3 on 8xA100 80G. The training process went fine, but I encountered OOM when attempting to save the model.

@KaiLv69
Copy link
Collaborator

KaiLv69 commented Aug 6, 2023

We've found that the OOM problem is due to the parameter gathering process with DeepSpeed's API. And we plan to fix it by gathering parameters one by one.

@dittops
Copy link

dittops commented Aug 18, 2023

@x54-729 Please let me know if you have pushed any updates on this to the dev branch. I can try it out.

@KaiLv69
Copy link
Collaborator

KaiLv69 commented Aug 23, 2023

@x54-729 Please let me know if you have pushed any updates on this to the dev branch. I can try it out.

Hi, the bug is fixed in dev branch, maybe you can have a try.

FYI: 82869ee ac6eed4

@dittops
Copy link

dittops commented Sep 4, 2023

I have tested the code. I was able to train and save the model.

I was testing by training a small dataset that contains the identity(English) of the model. But on inference, the model started generating Chinese instead of English while generating identity-related text.

I was using 70B + LOMO + Stage 3 + transformer-4.32.1

I have tried encoding and decoding the training data with the tokenizer and that looks fine. Any thoughts on what could be the issue here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants