Nan in training #345

zpphigh · 2021-02-07T08:43:10Z

Loss is Nan in training based on [openpose] + [VGG19] after 6800 iterations
nan in training based on [openpose] + [Resnet18] after 74300 iterations, as blow:

Train iteration 74300 / 1000000: Learning rate 9.999999747378752e-05 total_loss:51.99763870239258, conf_loss:25.99869155883789, paf_loss:74.33314514160156, l2_loss 1.8317127227783203 stage_num:6 time:0.0002014636993408203
stage_0 conf_loss:27.358610153198242 paf_loss:78.49413299560547
stage_1 conf_loss:26.15711784362793 paf_loss:74.4624252319336
stage_2 conf_loss:25.707374572753906 paf_loss:73.61676788330078
stage_3 conf_loss:25.627124786376953 paf_loss:73.40571594238281
stage_4 conf_loss:25.598114013671875 paf_loss:73.17063903808594
stage_5 conf_loss:25.543825149536133 paf_loss:72.84919738769531
Train iteration 74400 / 1000000: Learning rate 9.999999747378752e-05 total_loss:1148700065792.0, conf_loss:1743442149376.0, paf_loss:553957654528.0, l2_loss 1.8390066623687744 stage_num:6 time:0.00020194053649902344
stage_0 conf_loss:57.1938591003418 paf_loss:109.35887145996094
stage_1 conf_loss:1848.776611328125 paf_loss:3432.841552734375
stage_2 conf_loss:17413386.0 paf_loss:1569305.875
stage_3 conf_loss:135925504.0 paf_loss:4067570176.0
stage_4 conf_loss:2292520058880.0 paf_loss:300717211648.0
stage_5 conf_loss:8167979745280.0 paf_loss:3018960142336.0
Train iteration 74500 / 1000000: Learning rate 9.999999747378752e-05 total_loss:nan, conf_loss:nan, paf_loss:nan, l2_loss nan stage_num:6 time:0.0002295970916748047
stage_0 conf_loss:nan paf_loss:nan
stage_1 conf_loss:13212.9375 paf_loss:nan
stage_2 conf_loss:40226508.0 paf_loss:nan
stage_3 conf_loss:2792057995264.0 paf_loss:nan
stage_4 conf_loss:nan paf_loss:nan
stage_5 conf_loss:nan paf_loss:2.488148857507021e+16

lengyuner · 2021-04-30T11:33:52Z

the same problem as me.

Openpose + Resnet18
Loss becomes nan after iteration 1400.

Gyx-One · 2021-11-13T10:28:50Z

Hello！ @zpphigh @lengyuner
The Nan in training is due to the model parameter initialization. A not so good parameter initialization will lead to the divergence during training.

For [openpose] + [VGG19]:
Use the pretrained vgg-19 backbone at here and put it at the save_dir/pretrain_backbone to make sure hyperpose could load the pretrain backbone successfully during training.

For [openpose] + [Resnet 18]:
Sorry, I haven't try this setting for training before and thus I currently don't have the pretraining [Resnet 18] backbone. If time permits, pretrained [Resnet18] backbone will be released.

ganler assigned Gyx-One Jun 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nan in training #345

Nan in training #345

zpphigh commented Feb 7, 2021

lengyuner commented Apr 30, 2021

Gyx-One commented Nov 13, 2021

Nan in training #345

Nan in training #345

Comments

zpphigh commented Feb 7, 2021

lengyuner commented Apr 30, 2021

Gyx-One commented Nov 13, 2021