training failed in SRT heatmaps？ #80

Dian-Yi · 2021-01-05T07:45:45Z

Which project are you using?

SRT

There some problems in train heatmaps based on ProCPM model. i change your model backbone and train new model without pretrained weights. the trainnin loss log seems to be normal， but The test_300w NME is always 166.901.
when I show the batch_heatmaps, Only backgrounds can be trained, other foregrounds maps predicted is always zore numpy.

logs:
batch_size : 128
optimizer : sgd
LR : 0.0005
momentum : 0.9
Decay : 0.0005
nesterov : 1
criterion_ht : MSE-batch
epochs : 150
schedule : [60, 90, 120]
gamma : 0.1
pre_crop : 0.2
scale_min : 0.9
scale_max : 1.1
shear_max : 0.2
offset_max : 0.2
rotate_max : 30
cut_out : 0.1
sigma : 4
shape : [256, 256]
heatmap_type : gaussian
pixel_jitter_max : 20
downsample : 8
num_pts : 68

Training-data : GeneralDataset(point-num=68, shape=[256, 256], sigma=4, heatmap_type=gaussian, length=31528, cutout=0.1, dataset=train)
Testing-data : GeneralDataset(point-num=68, shape=[256, 256], sigma=4, heatmap_type=gaussian, length=689, cutout=0, dataset=test_300w)
Optimizer : SGD (
Parameter Group 0
dampening: 0
initial_lr: 0.0005
lr: 0.0005
momentum: 0.9
nesterov: 1
weight_decay: 0.0005
)
MSE Loss with reduction=['MSE', 'batch']
=> do not find the last-info file : ../snopshots/last-info.pth

==>>[2021-01-04 10:47:29] [epoch-000-150], [[Time Left: 00:00:00]], LR : [0.00050 ~ 0.00050], Config : {'epochs': 150, 'num_pts': 68, 'sigma': 4, 'print_freq': 10, 'downsample': 8, 'shape': [256, 256]}
-->[train]: [epoch-000-150][000/247] Time 16.47 (16.47) Data 7.20 (7.20) Forward 14.50 (14.50) Loss_all 104426.6016 (104426.6016) [Time Left: 01:07:30] ht_loss=104426.6016 : L1=35482.0781 : L2=35589.6680 : L3=33354.8516
-->[train]: [epoch-000-150][010/247] Time 1.11 (2.38) Data 0.16 (0.67) Forward 0.20 (1.38) Loss_all 279.1572 (11552.4368) [Time Left: 00:09:21] ht_loss=279.1572 : L1=93.0946 : L2=93.0299 : L3=93.0327
-->[train]: [epoch-000-150][020/247] Time 4.60 (2.21) Data 3.63 (0.87) Forward 3.68 (1.26) Loss_all 279.8896 (6183.9105) [Time Left: 00:08:20] ht_loss=279.8896 : L1=93.3314 : L2=93.2782 : L3=93.2800
-->[train]: [epoch-000-150][030/247] Time 0.95 (2.04) Data 0.00 (0.82) Forward 0.05 (1.10) Loss_all 279.0271 (4278.9738) [Time Left: 00:07:20] ht_loss=279.0271 : L1=93.0395 : L2=92.9885 : L3=92.9991
-->[train]: [epoch-000-150][040/247] Time 4.48 (2.12) Data 3.52 (0.96) Forward 3.56 (1.19) Loss_all 279.3033 (3303.2060) [Time Left: 00:07:16] ht_loss=279.3033 : L1=93.0816 : L2=93.1021 : L3=93.1196
-->[train]: [epoch-000-150][050/247] Time 0.96 (2.03) Data 0.00 (0.92) Forward 0.05 (1.11) Loss_all 278.0408 (2710.1125) [Time Left: 00:06:38] ht_loss=278.0408 : L1=92.7466 : L2=92.6507 : L3=92.6435
-->[train]: [epoch-000-150][060/247] Time 4.94 (2.13) Data 3.96 (1.04) Forward 4.00 (1.21) Loss_all 279.6303 (2311.4327) [Time Left: 00:06:36] ht_loss=279.6303 : L1=93.1926 : L2=93.2191 : L3=93.2186
-->[train]: [epoch-000-150][070/247] Time 0.98 (2.08) Data 0.00 (1.00) Forward 0.05 (1.16) Loss_all 277.1473 (2025.1064) [Time Left: 00:06:05] ht_loss=277.1473 : L1=92.4281 : L2=92.3588 : L3=92.3605
-->[train]: [epoch-000-150][080/247] Time 4.86 (2.10) Data 3.87 (1.04) Forward 3.91 (1.18) Loss_all 278.3179 (1809.4039) [Time Left: 00:05:48] ht_loss=278.3179 : L1=92.7645 : L2=92.7693 : L3=92.7842
-->[train]: [epoch-000-150][090/247] Time 0.98 (2.08) Data 0.00 (1.03) Forward 0.05 (1.16) Loss_all 277.4024 (1641.1797) [Time Left: 00:05:24] ht_loss=277.4024 : L1=92.4464 : L2=92.4706 : L3=92.4854
-->[train]: [epoch-000-150][100/247] Time 6.01 (2.09) Data 5.07 (1.05) Forward 5.11 (1.17) Loss_all 277.7260 (1506.2978) [Time Left: 00:05:05] ht_loss=277.7260 : L1=92.5351 : L2=92.5909 : L3=92.6000
-->[train]: [epoch-000-150][110/247] Time 0.98 (2.08) Data 0.00 (1.04) Forward 0.05 (1.15) Loss_all 278.5226 (1395.6688) [Time Left: 00:04:42] ht_loss=278.5226 : L1=92.8076 : L2=92.8527 : L3=92.8623
-->[train]: [epoch-000-150][120/247] Time 4.90 (2.08) Data 3.86 (1.05) Forward 3.98 (1.16) Loss_all 277.3715 (1303.3518) [Time Left: 00:04:22] ht_loss=277.3715 : L1=92.4822 : L2=92.4495 : L3=92.4398
-->[train]: [epoch-000-150][130/247] Time 1.01 (2.06) Data 0.00 (1.03) Forward 0.05 (1.14) Loss_all 277.4960 (1225.1186) [Time Left: 00:03:59] ht_loss=277.4960 : L1=92.5309 : L2=92.4777 : L3=92.4875
-->[train]: [epoch-000-150][140/247] Time 4.89 (2.07) Data 3.85 (1.04) Forward 3.89 (1.14) Loss_all 277.4550 (1157.9637) [Time Left: 00:03:39] ht_loss=277.4550 : L1=92.4373 : L2=92.4946 : L3=92.5231
-->[train]: [epoch-000-150][150/247] Time 0.97 (2.07) Data 0.00 (1.04) Forward 0.05 (1.14) Loss_all 277.0612 (1099.6810) [Time Left: 00:03:18] ht_loss=277.0612 : L1=92.3476 : L2=92.3554 : L3=92.3583
-->[train]: [epoch-000-150][160/247] Time 4.87 (2.09) Data 3.86 (1.06) Forward 3.90 (1.16) Loss_all 278.3372 (1048.6711) [Time Left: 00:02:59] ht_loss=278.3372 : L1=92.7852 : L2=92.7728 : L3=92.7792
-->[train]: [epoch-000-150][170/247] Time 1.05 (2.07) Data 0.00 (1.05) Forward 0.05 (1.14) Loss_all 278.7277 (1003.6236) [Time Left: 00:02:37] ht_loss=278.7277 : L1=92.8838 : L2=92.9143 : L3=92.9296
-->[train]: [epoch-000-150][180/247] Time 5.15 (2.08) Data 4.04 (1.06) Forward 4.21 (1.15) Loss_all 278.2527 (963.5538) [Time Left: 00:02:17] ht_loss=278.2527 : L1=92.8155 : L2=92.7081 : L3=92.7291
-->[train]: [epoch-000-150][190/247] Time 1.03 (2.05) Data 0.00 (1.04) Forward 0.05 (1.12) Loss_all 277.6393 (927.6484) [Time Left: 00:01:55] ht_loss=277.6393 : L1=92.5422 : L2=92.5445 : L3=92.5525
-->[train]: [epoch-000-150][200/247] Time 4.82 (2.08) Data 3.81 (1.06) Forward 3.86 (1.15) Loss_all 278.6940 (895.3700) [Time Left: 00:01:35] ht_loss=278.6940 : L1=92.8379 : L2=92.9121 : L3=92.9440
-->[train]: [epoch-000-150][210/247] Time 1.01 (2.07) Data 0.00 (1.05) Forward 0.05 (1.14) Loss_all 278.6920 (866.1021) [Time Left: 00:01:14] ht_loss=278.6920 : L1=92.8799 : L2=92.8969 : L3=92.9152
-->[train]: [epoch-000-150][220/247] Time 4.31 (2.09) Data 3.27 (1.07) Forward 3.32 (1.15) Loss_all 277.6641 (839.5094) [Time Left: 00:00:54] ht_loss=277.6641 : L1=92.5900 : L2=92.5280 : L3=92.5461
-->[train]: [epoch-000-150][230/247] Time 0.98 (2.08) Data 0.00 (1.06) Forward 0.05 (1.14) Loss_all 277.3657 (815.2032) [Time Left: 00:00:33] ht_loss=277.3657 : L1=92.4130 : L2=92.4656 : L3=92.4871
-->[train]: [epoch-000-150][240/247] Time 5.31 (2.09) Data 4.35 (1.07) Forward 4.40 (1.16) Loss_all 276.9100 (792.9332) [Time Left: 00:00:12] ht_loss=276.9100 : L1=92.2631 : L2=92.3141 : L3=92.3328
-->[train]: [epoch-000-150][246/247] Time 2.29 (2.09) Data 0.00 (1.07) Forward 1.03 (1.15) Loss_all 278.2914 (781.8199) [Time Left: 00:00:00] ht_loss=278.2914 : L1=92.7296 : L2=92.7785 : L3=92.7834
Eval dataset length 31528, labeled data length 31528
Compute NME for 31528 images with 68 points :: [(nms): mean=164.630, std=33.857]
==>>[2021-01-04 10:56:13] Train [epoch-000-150] Average Loss = 781.819878, NME = 164.63
save checkpoint into ../snopshots/checkpoint/HEATMAP-epoch-000-150.pth
save checkpoint into ../snopshots/last-info.pth

==>>[2021-01-04 10:56:13] [epoch-001-150], [[Time Left: 21:42:18]], LR : [0.00050 ~ 0.00050], Config : {'epochs': 150, 'num_pts': 68, 'sigma': 4, 'print_freq': 10, 'downsample': 8, 'shape': [256, 256]}
-->[train]: [epoch-001-150][000/247] Time 8.17 (8.17) Data 7.04 (7.04) Forward 7.16 (7.16) Loss_all 278.1276 (278.1276) [Time Left: 00:33:30] ht_loss=278.1276 : L1=92.7402 : L2=92.6863 : L3=92.7011
-->[train]: [epoch-001-150][010/247] Time 1.02 (2.70) Data 0.00 (1.68) Forward 0.05 (1.74) Loss_all 277.0975 (278.3596) [Time Left: 00:10:37] ht_loss=277.0975 : L1=92.3605 : L2=92.3538 : L3=92.3832
-->[train]: [epoch-001-150][020/247] Time 4.71 (2.55) Data 3.67 (1.54) Forward 3.72 (1.59) Loss_all 277.3455 (278.1147) [Time Left: 00:09:37] ht_loss=277.3455 : L1=92.4743 : L2=92.4308 : L3=92.4404
-->[train]: [epoch-001-150][030/247] Time 1.08 (2.31) Data 0.00 (1.29) Forward 0.11 (1.34) Loss_all 278.7683 (278.1484) [Time Left: 00:08:18] ht_loss=278.7683 : L1=92.9498 : L2=92.9005 : L3=92.9180
-->[train]: [epoch-001-150][040/247] Time 7.17 (2.35) Data 6.17 (1.33) Forward 6.22 (1.38) Loss_all 278.2269 (278.1230) [Time Left: 00:08:04] ht_loss=278.2269 : L1=92.6964 : L2=92.7372 : L3=92.7933
-->[train]: [epoch-001-150][050/247] Time 1.03 (2.30) Data 0.00 (1.27) Forward 0.05 (1.32) Loss_all 276.7727 (278.0273) [Time Left: 00:07:29] ht_loss=276.7727 : L1=92.2674 : L2=92.2390 : L3=92.2664
-->[train]: [epoch-001-150][060/247] Time 5.33 (2.29) Data 4.27 (1.26) Forward 4.32 (1.32) Loss_all 277.9707 (277.9931) [Time Left: 00:07:06] ht_loss=277.9707 : L1=92.5334 : L2=92.6936 : L3=92.7436
-->[train]: [epoch-001-150][070/247] Time 1.08 (2.23) Data 0.00 (1.20) Forward 0.07 (1.25) Loss_all 276.6379 (277.9465) [Time Left: 00:06:32] ht_loss=276.6379 : L1=92.0694 : L2=92.2572 : L3=92.3113
-->[train]: [epoch-001-150][080/247] Time 5.25 (2.25) Data 4.22 (1.21) Forward 4.27 (1.26) Loss_all 278.9980 (277.9131) [Time Left: 00:06:12] ht_loss=278.9980 : L1=92.9393 : L2=93.0119 : L3=93.0468
-->[train]: [epoch-001-150][090/247] Time 1.11 (2.20) Data 0.00 (1.16) Forward 0.06 (1.22) Loss_all 278.4267 (277.9751) [Time Left: 00:05:43] ht_loss=278.4267 : L1=92.7253 : L2=92.8161 : L3=92.8853
-->[train]: [epoch-001-150][100/247] Time 3.97 (2.22) Data 2.93 (1.18) Forward 2.98 (1.23) Loss_all 277.6642 (277.8997) [Time Left: 00:05:23] ht_loss=277.6642 : L1=92.5115 : L2=92.5662 : L3=92.5865
-->[train]: [epoch-001-150][110/247] Time 1.00 (2.21) Data 0.00 (1.17) Forward 0.04 (1.22) Loss_all 278.4783 (277.8227) [Time Left: 00:05:00] ht_loss=278.4783 : L1=92.5632 : L2=92.6771 : L3=93.2380
-->[train]: [epoch-001-150][120/247] Time 1.53 (2.18) Data 0.52 (1.14) Forward 0.56 (1.20) Loss_all 278.9060 (277.7973) [Time Left: 00:04:34] ht_loss=278.9060 : L1=92.9879 : L2=92.9261 : L3=92.9920
-->[train]: [epoch-001-150][130/247] Time 1.04 (2.20) Data 0.00 (1.16) Forward 0.07 (1.22) Loss_all 275.1124 (277.7316) [Time Left: 00:04:15] ht_loss=275.1124 : L1=91.6005 : L2=91.7149 : L3=91.7970
-->[train]: [epoch-001-150][140/247] Time 1.36 (2.19) Data 0.35 (1.15) Forward 0.39 (1.21) Loss_all 277.0453 (277.6980) [Time Left: 00:03:52] ht_loss=277.0453 : L1=92.2058 : L2=92.3897 : L3=92.4498
-->[train]: [epoch-001-150][150/247] Time 1.05 (2.20) Data 0.00 (1.16) Forward 0.05 (1.22) Loss_all 277.4876 (277.6891) [Time Left: 00:03:31] ht_loss=277.4876 : L1=92.4398 : L2=92.4923 : L3=92.5554
-->[train]: [epoch-001-150][160/247] Time 1.01 (2.19) Data 0.00 (1.15) Forward 0.06 (1.20) Loss_all 276.7957 (277.6487) [Time Left: 00:03:08] ht_loss=276.7957 : L1=92.1740 : L2=92.2719 : L3=92.3497
-->[train]: [epoch-001-150][170/247] Time 1.07 (2.21) Data 0.00 (1.17) Forward 0.05 (1.23) Loss_all 274.4813 (277.5769) [Time Left: 00:02:48] ht_loss=274.4813 : L1=91.2472 : L2=91.5625 : L3=91.6715
-->[train]: [epoch-001-150][180/247] Time 4.76 (2.20) Data 3.73 (1.16) Forward 3.78 (1.22) Loss_all 277.1411 (277.5542) [Time Left: 00:02:25] ht_loss=277.1411 : L1=92.2720 : L2=92.3849 : L3=92.4842
-->[train]: [epoch-001-150][190/247] Time 1.03 (2.19) Data 0.00 (1.15) Forward 0.06 (1.20) Loss_all 276.4174 (277.5101) [Time Left: 00:02:02] ht_loss=276.4174 : L1=91.9825 : L2=92.1831 : L3=92.2517
-->[train]: [epoch-001-150][200/247] Time 1.08 (2.18) Data 0.00 (1.14) Forward 0.08 (1.20) Loss_all 276.7092 (277.4698) [Time Left: 00:01:40] ht_loss=276.7092 : L1=92.0072 : L2=92.3023 : L3=92.3998
-->[train]: [epoch-001-150][210/247] Time 1.01 (2.19) Data 0.00 (1.15) Forward 0.05 (1.21) Loss_all 277.3184 (277.4343) [Time Left: 00:01:18] ht_loss=277.3184 : L1=92.2769 : L2=92.4722 : L3=92.5693
-->[train]: [epoch-001-150][220/247] Time 2.94 (2.19) Data 1.92 (1.15) Forward 1.96 (1.20) Loss_all 275.9615 (277.3970) [Time Left: 00:00:56] ht_loss=275.9615 : L1=91.9907 : L2=91.9326 : L3=92.0382
-->[train]: [epoch-001-150][230/247] Time 1.04 (2.19) Data 0.00 (1.15) Forward 0.07 (1.20) Loss_all 276.7830 (277.3277) [Time Left: 00:00:35] ht_loss=276.7830 : L1=91.9892 : L2=92.3281 : L3=92.4656
-->[train]: [epoch-001-150][240/247] Time 4.19 (2.20) Data 3.19 (1.16) Forward 3.24 (1.22) Loss_all 275.8693 (277.2501) [Time Left: 00:00:13] ht_loss=275.8693 : L1=91.6268 : L2=92.0341 : L3=92.2084
-->[train]: [epoch-001-150][246/247] Time 0.37 (2.19) Data 0.00 (1.15) Forward 0.04 (1.21) Loss_all 277.2675 (277.2426) [Time Left: 00:00:00] ht_loss=277.2675 : L1=92.4372 : L2=92.3524 : L3=92.4779
Eval dataset length 31528, labeled data length 31528
Compute NME for 31528 images with 68 points :: [(nms): mean=165.022, std=33.734]
==>>[2021-01-04 11:05:24] Train [epoch-001-150] Average Loss = 277.242612, NME = 165.02
save checkpoint into ../snopshots/checkpoint/HEATMAP-epoch-001-150.pth
save checkpoint into ../snopshots/last-info.pth
Basic-Eval-All evaluates 1 dataset
==>>[2021-01-04 11:05:24], [epoch-001-150], evaluate the 0/1-th dataset [image] : GeneralDataset(point-num=68, shape=[256, 256], sigma=4, heatmap_type=gaussian, length=689, cutout=0, dataset=test_300w)
-->[test]: [epoch-001-150][000/006] Time 6.60 (6.60) Data 6.19 (6.19) Forward 6.22 (6.22) Loss_all 280.0911 (280.0911) [Time Left: 00:00:33] ht_loss=280.0911 : L1=93.1269 : L2=93.3953 : L3=93.5689
-->[test]: [epoch-001-150][005/006] Time 1.23 (1.57) Data 0.00 (1.03) Forward 1.14 (1.25) Loss_all 280.1683 (279.7528) [Time Left: 00:00:00] ht_loss=280.1683 : L1=93.1425 : L2=93.4033 : L3=93.6225
Eval dataset length 689, labeled data length 689
Compute NME for 689 images with 68 points :: [(nms): mean=166.901, std=24.249]
NME Results :
->test_300w : NME = 166.901,

Dian-Yi · 2021-01-07T02:12:19Z

I did some experiments to find problems.
There is a sample heatmap label data from: def generate_label_map(pts, height, width, sigma, downsample, nopoints, ctype):
the sum of each map's value > 0, sigma:4, downsample:8:
[ 7, 6, 8, 7, 7, 7, 7, 7, 7, 8, 7, 8,
7, 7, 7, 7, 8, 8, 7, 8, 7, 7, 8, 8,
8, 7, 8, 7, 7, 7, 7, 8, 7, 7, 7, 7,
7, 7, 7, 8, 7, 7, 7, 7, 7, 8, 6, 8,
7, 7, 8, 7, 8, 8, 7, 7, 6, 7, 7, 8,
8, 8, 7, 8, 7, 7, 8, 7, 1024]
the max value of each maps:
[0.8979, 0.8032, 0.6506, 0.6165, 0.6372, 0.4791, 0.5207, 0.4606, 0.8633,
0.8925, 0.8269, 0.7173, 0.8287, 0.5979, 0.5846, 0.6472, 0.7989, 0.7900,
0.9217, 0.6162, 0.6745, 0.5610, 0.8472, 0.8779, 0.0000, 0.9160, 0.5829,
0.4558, 0.6468, 0.4131, 0.6791, 0.7397, 0.8015, 0.6352, 0.7229, 0.6346,
0.4943, 0.6630, 0.8280, 0.6312, 0.9719, 0.9219, 0.7392, 0.7575, 0.6396,
0.9408, 0.8714, 0.8833, 0.6204, 0.6979, 0.6662, 0.8334, 0.7105, 0.7566,
0.9396, 0.8781, 0.6446, 0.9268, 0.7921, 0.9549, 0.5869, 0.9428, 0.8311,
0.9922, 0.5339, 0.8815, 0.9315, 0.9609, 1.0000]

you can find the number of background label is 32*32. There is an imbalance between background and foreground. so i change the function(generate_label_map) and add AwingLoss and loss weight maps. They are all from a paper(AwingLoss). After,my model can be trianed well.
Have you ever encountered such a problem when you train the ProCpm model ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training failed in SRT heatmaps？ #80

training failed in SRT heatmaps？ #80

Dian-Yi commented Jan 5, 2021

Dian-Yi commented Jan 7, 2021 •

edited

training failed in SRT heatmaps？ #80

training failed in SRT heatmaps？ #80

Comments

Dian-Yi commented Jan 5, 2021

Which project are you using?

Dian-Yi commented Jan 7, 2021 • edited

Dian-Yi commented Jan 7, 2021 •

edited