Support for LLaMA-2 70B with Grouped-Query Attention #91

kaiwang13 · 2023-07-21T16:58:54Z

Due to the Grouped-Query Attention introduced in LLaMA-2 70B，llama issue，it cannot be loaded into the collie implementation of LLaMA. Hope LLaMA-2 70B can be support in collie. Thanks

Traceback (most recent call last):
  File "/nvme1/gptdata/share1/projects/collie/examples/download.py", line 49, in <module>
    model = LlamaForCausalLM.from_pretrained(model_name, config=config)
  File "/nvme1/gptdata/share1/app/mambaforge/envs/collie/lib/python3.9/site-packages/collie/models/base.py", line 306, in from_pretrained
    state_dict = cls.load_parallel_state_dict(
  File "/nvme1/gptdata/share1/app/mambaforge/envs/collie/lib/python3.9/site-packages/collie/models/llama/model.py", line 414, in load_parallel_state_dict
    part_state_dict[key] = rearrange(
RuntimeError: shape '[8192, 8192]' is invalid for input of size 8388608

The text was updated successfully, but these errors were encountered:

dittops · 2023-07-24T07:28:12Z

I got the same error for LLaMa2 70B

dittops · 2023-07-25T12:54:05Z

@kaiwang13 Could you please share how you resolved the issue?

kaiwang13 · 2023-07-25T12:55:57Z

@kaiwang13 Could you please share how you resolved the issue?

Just uninstall the old version and install the latest one from source code.

dittops · 2023-07-25T13:07:15Z

I just cloned the repo and installed it from the main branch. But I'm still facing the error. Do I need to install it from any specific branch?

kaiwang13 · 2023-07-25T15:06:16Z

I just cloned the repo and installed it from the main branch. But I'm still facing the error. Do I need to install it from any specific branch?

Remove https://github.com/OpenLMLab/collie/blob/c9cc0055a52b96d156450b5734a0a1d0dbde4562/collie/models/llama/model.py#L425C1-L432C64

dittops · 2023-07-25T15:48:06Z

This resolved the issue of the shape error. But now it seems offloading to CPU memory and the process get killed because of CPU overloading. I have 450 GB of CPU memory and 4*A100 80 GB. I'm using the LOMO optimizer. Is this expected for LOMO?

kaiwang13 · 2023-07-29T09:27:31Z

This resolved the issue of the shape error. But now it seems offloading to CPU memory and the process get killed because of CPU overloading. I have 450 GB of CPU memory and 4*A100 80 GB. I'm using the LOMO optimizer. Is this expected for LOMO?

What I did cannot solve the problem. The pretrained state dict cannot be loaded for training without thos code.

dittops · 2023-07-29T10:17:10Z

okay, so is there any suggestion to solve the problem?

x54-729 · 2023-08-02T03:43:26Z

This resolved the issue of the shape error. But now it seems offloading to CPU memory and the process get killed because of CPU overloading. I have 450 GB of CPU memory and 4*A100 80 GB. I'm using the LOMO optimizer. Is this expected for LOMO?

What I did cannot solve the problem. The pretrained state dict cannot be loaded for training without thos code.

Do you mean that the error occurs when loading pretrained state dict? Could you please show the error log?
Sorry for the late reply

dittops · 2023-08-02T04:54:15Z

I was testing this with the main branch. While loading the state dict, the CPU is taking around 550GB. Yesterday I tried that in a larger instance with 900GB CPU and it got some shape error during the train start as @kaiwang13 mentioned. I don't have access to that machine right now to share the log.

However, yesterday I tested the dev branch as well. In the dev branch, the CPU was taking only around 150GB. But there, I was getting OOM while saving the checkpoint after the first epoch. You can check issue #98 about this.

Let me know if this info is enough for you to proceed further

x54-729 · 2023-08-02T05:30:03Z

I was testing this with the main branch. While loading the state dict, the CPU is taking around 550GB. Yesterday I tried that in a larger instance with 900GB CPU and it got some shape error during the train start as @kaiwang13 mentioned. I don't have access to that machine right now to share the log.

However, yesterday I tested the dev branch as well. In the dev branch, the CPU was taking only around 150GB. But there, I was getting OOM while saving the checkpoint after the first epoch. You can check issue #98 about this.

Let me know if this info is enough for you to proceed further

Thanks for your information! The latest llama-v2 is not updated to main branch yet, so it is normal to raise errors in main branch.
We will test the process of saving checkpoint later. Does OOM occur when using zero3 and LOMO when saving checkpoint? And is it caused by GPU OOM or CPU OOM?

dittops · 2023-08-02T05:42:17Z

Yes, I was using zero3 and LOMO. And I was getting GPU OOM while saving

x54-729 · 2023-08-02T09:18:55Z

Yes, I was using zero3 and LOMO. And I was getting GPU OOM while saving

Thanks a lot! We will try to fix it

kaiwang13 · 2023-08-06T14:05:23Z

Yes, I was using zero3 and LOMO. And I was getting GPU OOM while saving

Thanks a lot! We will try to fix it

@dittops @x54-729 Additionally, I tried training llama1-33b with a sequence length of 2048 and a batch size of 1 using AdamW with zero3 on 8xA100 80G. The training process went fine, but I encountered OOM when attempting to save the model.

KaiLv69 · 2023-08-06T14:17:25Z

We've found that the OOM problem is due to the parameter gathering process with DeepSpeed's API. And we plan to fix it by gathering parameters one by one.

dittops · 2023-08-18T12:22:02Z

@x54-729 Please let me know if you have pushed any updates on this to the dev branch. I can try it out.

KaiLv69 · 2023-08-23T08:42:37Z

@x54-729 Please let me know if you have pushed any updates on this to the dev branch. I can try it out.

Hi, the bug is fixed in dev branch, maybe you can have a try.

FYI: 82869ee ac6eed4

dittops · 2023-09-04T17:16:10Z

I have tested the code. I was able to train and save the model.

I was testing by training a small dataset that contains the identity(English) of the model. But on inference, the model started generating Chinese instead of English while generating identity-related text.

I was using 70B + LOMO + Stage 3 + transformer-4.32.1

I have tried encoding and decoding the training data with the tokenizer and that looks fine. Any thoughts on what could be the issue here?

kaiwang13 changed the title ~~LLaMA-2 70B GQA支持~~ Support for LLaMA-2 70B with Grouped-Query Attention Jul 21, 2023

kaiwang13 closed this as completed Jul 25, 2023

kaiwang13 reopened this Jul 25, 2023

kaiwang13 closed this as completed Jul 25, 2023

kaiwang13 reopened this Jul 25, 2023

00INDEX added the bug Something isn't working label Aug 1, 2023

00INDEX assigned x54-729 Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for LLaMA-2 70B with Grouped-Query Attention #91

Support for LLaMA-2 70B with Grouped-Query Attention #91

kaiwang13 commented Jul 21, 2023 •

edited

dittops commented Jul 24, 2023

dittops commented Jul 25, 2023

kaiwang13 commented Jul 25, 2023

dittops commented Jul 25, 2023

kaiwang13 commented Jul 25, 2023

dittops commented Jul 25, 2023

kaiwang13 commented Jul 29, 2023

dittops commented Jul 29, 2023

x54-729 commented Aug 2, 2023

dittops commented Aug 2, 2023 •

edited

x54-729 commented Aug 2, 2023 •

edited

dittops commented Aug 2, 2023

x54-729 commented Aug 2, 2023

kaiwang13 commented Aug 6, 2023

KaiLv69 commented Aug 6, 2023 •

edited

dittops commented Aug 18, 2023

KaiLv69 commented Aug 23, 2023

dittops commented Sep 4, 2023 •

edited

Support for LLaMA-2 70B with Grouped-Query Attention #91

Support for LLaMA-2 70B with Grouped-Query Attention #91

Comments

kaiwang13 commented Jul 21, 2023 • edited

dittops commented Jul 24, 2023

dittops commented Jul 25, 2023

kaiwang13 commented Jul 25, 2023

dittops commented Jul 25, 2023

kaiwang13 commented Jul 25, 2023

dittops commented Jul 25, 2023

kaiwang13 commented Jul 29, 2023

dittops commented Jul 29, 2023

x54-729 commented Aug 2, 2023

dittops commented Aug 2, 2023 • edited

x54-729 commented Aug 2, 2023 • edited

dittops commented Aug 2, 2023

x54-729 commented Aug 2, 2023

kaiwang13 commented Aug 6, 2023

KaiLv69 commented Aug 6, 2023 • edited

dittops commented Aug 18, 2023

KaiLv69 commented Aug 23, 2023

dittops commented Sep 4, 2023 • edited

kaiwang13 commented Jul 21, 2023 •

edited

dittops commented Aug 2, 2023 •

edited

x54-729 commented Aug 2, 2023 •

edited

KaiLv69 commented Aug 6, 2023 •

edited

dittops commented Sep 4, 2023 •

edited