Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nan in training #345

Open
zpphigh opened this issue Feb 7, 2021 · 2 comments
Open

Nan in training #345

zpphigh opened this issue Feb 7, 2021 · 2 comments
Assignees

Comments

@zpphigh
Copy link

zpphigh commented Feb 7, 2021

Loss is Nan in training based on [openpose] + [VGG19] after 6800 iterations
nan in training based on [openpose] + [Resnet18] after 74300 iterations, as blow:

Train iteration 74300 / 1000000: Learning rate 9.999999747378752e-05 total_loss:51.99763870239258, conf_loss:25.99869155883789, paf_loss:74.33314514160156, l2_loss 1.8317127227783203 stage_num:6 time:0.0002014636993408203
stage_0 conf_loss:27.358610153198242 paf_loss:78.49413299560547
stage_1 conf_loss:26.15711784362793 paf_loss:74.4624252319336
stage_2 conf_loss:25.707374572753906 paf_loss:73.61676788330078
stage_3 conf_loss:25.627124786376953 paf_loss:73.40571594238281
stage_4 conf_loss:25.598114013671875 paf_loss:73.17063903808594
stage_5 conf_loss:25.543825149536133 paf_loss:72.84919738769531
Train iteration 74400 / 1000000: Learning rate 9.999999747378752e-05 total_loss:1148700065792.0, conf_loss:1743442149376.0, paf_loss:553957654528.0, l2_loss 1.8390066623687744 stage_num:6 time:0.00020194053649902344
stage_0 conf_loss:57.1938591003418 paf_loss:109.35887145996094
stage_1 conf_loss:1848.776611328125 paf_loss:3432.841552734375
stage_2 conf_loss:17413386.0 paf_loss:1569305.875
stage_3 conf_loss:135925504.0 paf_loss:4067570176.0
stage_4 conf_loss:2292520058880.0 paf_loss:300717211648.0
stage_5 conf_loss:8167979745280.0 paf_loss:3018960142336.0
Train iteration 74500 / 1000000: Learning rate 9.999999747378752e-05 total_loss:nan, conf_loss:nan, paf_loss:nan, l2_loss nan stage_num:6 time:0.0002295970916748047
stage_0 conf_loss:nan paf_loss:nan
stage_1 conf_loss:13212.9375 paf_loss:nan
stage_2 conf_loss:40226508.0 paf_loss:nan
stage_3 conf_loss:2792057995264.0 paf_loss:nan
stage_4 conf_loss:nan paf_loss:nan
stage_5 conf_loss:nan paf_loss:2.488148857507021e+16

@lengyuner
Copy link

the same problem as me.

Openpose + Resnet18
Loss becomes nan after iteration 1400.

@Gyx-One
Copy link
Contributor

Gyx-One commented Nov 13, 2021

Hello! @zpphigh @lengyuner
The Nan in training is due to the model parameter initialization. A not so good parameter initialization will lead to the divergence during training.

For [openpose] + [VGG19]:
Use the pretrained vgg-19 backbone at here and put it at the save_dir/pretrain_backbone to make sure hyperpose could load the pretrain backbone successfully during training.

For [openpose] + [Resnet 18]:
Sorry, I haven't try this setting for training before and thus I currently don't have the pretraining [Resnet 18] backbone. If time permits, pretrained [Resnet18] backbone will be released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants