You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During continuing training MoE models(loading existing ckpt), at some steps, assert errors occurred as follows:
"found NaN in local grad norm in backward pass before data-parallel communication collective".
f'Rank {global_rank}: found NaN in local grad norm in '
Main Settings
tp=1,pp=8,ep=2
use_mcore=True
impl=transformers_engine
distributed_optimizer=True.
Questions
At steps=A, an assert error occurred. however, resume training from latest ckpt, assert error would not happen at steps=A.(samples sequence is fixed). Besides, during resume training process, except loss at the very first step, losses of all subsequent steps have tiny numeric differences. Could you explain the reasons?
How to figure out the above NaN error, could you give me some advice to debugging details? Thanks.
The text was updated successfully, but these errors were encountered:
I got a same error, when I use Megatron training deepseek model on SFT. so any body know what's the problem
@D1026 did you train deepseek dense model or deepseek-moe model?
Often this error happened due to data.
However, in my case, data seems ok. I am not sure whether this case is related to moe pretraining.
During continuing training MoE models(loading existing ckpt), at some steps, assert errors occurred as follows:
"found NaN in local grad norm in backward pass before data-parallel communication collective".
Megatron-LM/megatron/core/distributed/param_and_grad_buffer.py
Line 115 in caf2007
Main Settings
Questions
The text was updated successfully, but these errors were encountered: