[QUESTION] found NaN in local grad norm in backward pass before data-parallel communication collective #780

ftgreat · 2024-04-16T05:37:21Z

During continuing training MoE models(loading existing ckpt), at some steps, assert errors occurred as follows:
"found NaN in local grad norm in backward pass before data-parallel communication collective".

Megatron-LM/megatron/core/distributed/param_and_grad_buffer.py

Line 115 in caf2007

f'Rank {global_rank}: found NaN in local grad norm in '

Main Settings

tp=1,pp=8,ep=2
use_mcore=True
impl=transformers_engine
distributed_optimizer=True.

Questions

1. At steps=A, an assert error occurred. however, resume training from latest ckpt, assert error would not happen at steps=A.(samples sequence is fixed). Besides, during resume training process, except loss at the very first step, losses of all subsequent steps have tiny numeric differences. Could you explain the reasons?
1. How to figure out the above NaN error, could you give me some advice to debugging details? Thanks.

D1026 · 2024-04-19T02:33:12Z

I got a same error, when I use Megatron training deepseek model on SFT. so any body know what's the problem

ftgreat · 2024-04-20T11:31:08Z

I got a same error, when I use Megatron training deepseek model on SFT. so any body know what's the problem

@D1026 did you train deepseek dense model or deepseek-moe model?
Often this error happened due to data.
However, in my case, data seems ok. I am not sure whether this case is related to moe pretraining.

980202006 · 2024-05-27T02:34:32Z

Same issue!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] found NaN in local grad norm in backward pass before data-parallel communication collective #780

[QUESTION] found NaN in local grad norm in backward pass before data-parallel communication collective #780

ftgreat commented Apr 16, 2024 •

edited

D1026 commented Apr 19, 2024

ftgreat commented Apr 20, 2024 •

edited

980202006 commented May 27, 2024

[QUESTION] found NaN in local grad norm in backward pass before data-parallel communication collective #780

[QUESTION] found NaN in local grad norm in backward pass before data-parallel communication collective #780

Comments

ftgreat commented Apr 16, 2024 • edited

Main Settings

Questions

D1026 commented Apr 19, 2024

ftgreat commented Apr 20, 2024 • edited

980202006 commented May 27, 2024

ftgreat commented Apr 16, 2024 •

edited

ftgreat commented Apr 20, 2024 •

edited