Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: llama2 70B LlamaForCausalLM.from_pretrained 开启Zero3,会消耗大量内存导致 OOM #98

Open
xiaopqr opened this issue Jul 31, 2023 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@xiaopqr
Copy link

xiaopqr commented Jul 31, 2023

8张 V100 显卡,开启 Zero3,TP=1,PP=1,DP=8,LlamaForCausalLM.from_pretrained llama 70B 模型会出现 OOM (内存不够,不是显存不够),物理内存 512GB。
原因是 dev 分支中,base.py 304行,
state_dict = {}
if not is_zero3_enabled(config) or env.dp_rank == 0
or config.low_cpu_mem_usage or config.quantization_config.load_in_8bit
or getattr(config.quantization_config, "load_in_4bit", False):
state_dict = cls.load_parallel_state_dict(
path=model_path_or_name, config=config,
process_exclusion=process_exclusion, **kwargs
)
会导致 8 个进程 都 加载一次 state_dict,内存消耗很大,导致OOM

@xiaopqr xiaopqr changed the title llama2 70B Error: llama2 70B LlamaForCausalLM.from_pretrained 开启Zero3,会消耗大量内存导致 OOM Jul 31, 2023
@xiaopqr
Copy link
Author

xiaopqr commented Aug 1, 2023

@KaiLv69 大佬方便看一下吗?

@KaiLv69 KaiLv69 assigned KaiLv69 and 00INDEX and unassigned KaiLv69 Aug 1, 2023
@00INDEX
Copy link
Collaborator

00INDEX commented Aug 1, 2023

@xiaopqr 您好,很抱歉造成您使用当中的不便,此问题已在 1871bcb 中修复,请使用dev分支的版本,或者等待下个版本的主分支合并。

@00INDEX 00INDEX added the bug Something isn't working label Aug 1, 2023
@dittops
Copy link

dittops commented Aug 1, 2023

I have tested this version of the branch in 4 * A100 80GB. The training is happening, but I'm getting OOM while saving the checkpoint.

@KaiLv69
Copy link
Collaborator

KaiLv69 commented Aug 23, 2023

I have tested this version of the branch in 4 * A100 80GB. The training is happening, but I'm getting OOM while saving the checkpoint.

Hi, the bug is fixed in dev branch, maybe you can have a try.

FYI: 82869ee ac6eed4

@0three
Copy link

0three commented Sep 2, 2023

Could you pls share the script of training 70B?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants