Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout Error in all_gather during evaluate_ppo() on 2 H100 GPUs with miniLLM and ZeRO #127

Open
Ispanicus opened this issue Dec 12, 2023 · 2 comments

Comments

@Ispanicus
Copy link

Hi, I'm using ZeRO with optimizer and parameter offload to run minillm on 2 H100 gpus on a single node. After doing the generation evaluation, I get a timeout during the all_gather step.

Generation Evaluation: 100%|█████████▉| 497/499 [18:29:58<05:20, 160.10s/it][E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22134630, OpType=ALLGATHER, NumelIn=499, NumelOut=998, Timeout(ms)=18000000) ran for 18000109 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22134630, OpType=ALLGATHER, NumelIn=499, NumelOut=998, Timeout(ms)=18000000) ran for 18000109 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22134630, OpType=ALLGATHER, NumelIn=499, NumelOut=998, Timeout(ms)=18000000) ran for 18000109 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22134629, OpType=_ALLGATHER_BASE, NumelIn=65536000, NumelOut=131072000, Timeout(ms)=18000000) ran for 18000929 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22134629, OpType=_ALLGATHER_BASE, NumelIn=65536000, NumelOut=131072000, Timeout(ms)=18000000) ran for 18000929 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22134629, OpType=_ALLGATHER_BASE, NumelIn=65536000, NumelOut=131072000, Timeout(ms)=18000000) ran for 18000929 milliseconds before timing out.

I've tried increasing the timeout period without success. Are there any other configurations or steps I can take to resolve this timeout issue?

Thank you for your help!

@donglixp
Copy link
Contributor

Have you tried A100s or V100s? I am unsure whether the above error only appears with H100s.

@Ispanicus
Copy link
Author

Ispanicus commented Dec 12, 2023

I unfortunately only have access to 2 H100s. It could be an issue, since they run on cuda sm_90, but I wouldn't know where to begin to debug that.

@Ispanicus Ispanicus reopened this Dec 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants