Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training failed in SRT heatmaps? #80

Open
Dian-Yi opened this issue Jan 5, 2021 · 1 comment
Open

training failed in SRT heatmaps? #80

Dian-Yi opened this issue Jan 5, 2021 · 1 comment

Comments

@Dian-Yi
Copy link

Dian-Yi commented Jan 5, 2021

Which project are you using?

SRT

There some problems in train heatmaps based on ProCPM model. i change your model backbone and train new model without pretrained weights. the trainnin loss log seems to be normal, but The test_300w NME is always 166.901.
when I show the batch_heatmaps, Only backgrounds can be trained, other foregrounds maps predicted is always zore numpy.

logs:
batch_size : 128
optimizer : sgd
LR : 0.0005
momentum : 0.9
Decay : 0.0005
nesterov : 1
criterion_ht : MSE-batch
epochs : 150
schedule : [60, 90, 120]
gamma : 0.1
pre_crop : 0.2
scale_min : 0.9
scale_max : 1.1
shear_max : 0.2
offset_max : 0.2
rotate_max : 30
cut_out : 0.1
sigma : 4
shape : [256, 256]
heatmap_type : gaussian
pixel_jitter_max : 20
downsample : 8
num_pts : 68

Training-data : GeneralDataset(point-num=68, shape=[256, 256], sigma=4, heatmap_type=gaussian, length=31528, cutout=0.1, dataset=train)
Testing-data : GeneralDataset(point-num=68, shape=[256, 256], sigma=4, heatmap_type=gaussian, length=689, cutout=0, dataset=test_300w)
Optimizer : SGD (
Parameter Group 0
dampening: 0
initial_lr: 0.0005
lr: 0.0005
momentum: 0.9
nesterov: 1
weight_decay: 0.0005
)
MSE Loss with reduction=['MSE', 'batch']
=> do not find the last-info file : ../snopshots/last-info.pth

==>>[2021-01-04 10:47:29] [epoch-000-150], [[Time Left: 00:00:00]], LR : [0.00050 ~ 0.00050], Config : {'epochs': 150, 'num_pts': 68, 'sigma': 4, 'print_freq': 10, 'downsample': 8, 'shape': [256, 256]}
-->[train]: [epoch-000-150][000/247] Time 16.47 (16.47) Data 7.20 (7.20) Forward 14.50 (14.50) Loss_all 104426.6016 (104426.6016) [Time Left: 01:07:30] ht_loss=104426.6016 : L1=35482.0781 : L2=35589.6680 : L3=33354.8516
-->[train]: [epoch-000-150][010/247] Time 1.11 (2.38) Data 0.16 (0.67) Forward 0.20 (1.38) Loss_all 279.1572 (11552.4368) [Time Left: 00:09:21] ht_loss=279.1572 : L1=93.0946 : L2=93.0299 : L3=93.0327
-->[train]: [epoch-000-150][020/247] Time 4.60 (2.21) Data 3.63 (0.87) Forward 3.68 (1.26) Loss_all 279.8896 (6183.9105) [Time Left: 00:08:20] ht_loss=279.8896 : L1=93.3314 : L2=93.2782 : L3=93.2800
-->[train]: [epoch-000-150][030/247] Time 0.95 (2.04) Data 0.00 (0.82) Forward 0.05 (1.10) Loss_all 279.0271 (4278.9738) [Time Left: 00:07:20] ht_loss=279.0271 : L1=93.0395 : L2=92.9885 : L3=92.9991
-->[train]: [epoch-000-150][040/247] Time 4.48 (2.12) Data 3.52 (0.96) Forward 3.56 (1.19) Loss_all 279.3033 (3303.2060) [Time Left: 00:07:16] ht_loss=279.3033 : L1=93.0816 : L2=93.1021 : L3=93.1196
-->[train]: [epoch-000-150][050/247] Time 0.96 (2.03) Data 0.00 (0.92) Forward 0.05 (1.11) Loss_all 278.0408 (2710.1125) [Time Left: 00:06:38] ht_loss=278.0408 : L1=92.7466 : L2=92.6507 : L3=92.6435
-->[train]: [epoch-000-150][060/247] Time 4.94 (2.13) Data 3.96 (1.04) Forward 4.00 (1.21) Loss_all 279.6303 (2311.4327) [Time Left: 00:06:36] ht_loss=279.6303 : L1=93.1926 : L2=93.2191 : L3=93.2186
-->[train]: [epoch-000-150][070/247] Time 0.98 (2.08) Data 0.00 (1.00) Forward 0.05 (1.16) Loss_all 277.1473 (2025.1064) [Time Left: 00:06:05] ht_loss=277.1473 : L1=92.4281 : L2=92.3588 : L3=92.3605
-->[train]: [epoch-000-150][080/247] Time 4.86 (2.10) Data 3.87 (1.04) Forward 3.91 (1.18) Loss_all 278.3179 (1809.4039) [Time Left: 00:05:48] ht_loss=278.3179 : L1=92.7645 : L2=92.7693 : L3=92.7842
-->[train]: [epoch-000-150][090/247] Time 0.98 (2.08) Data 0.00 (1.03) Forward 0.05 (1.16) Loss_all 277.4024 (1641.1797) [Time Left: 00:05:24] ht_loss=277.4024 : L1=92.4464 : L2=92.4706 : L3=92.4854
-->[train]: [epoch-000-150][100/247] Time 6.01 (2.09) Data 5.07 (1.05) Forward 5.11 (1.17) Loss_all 277.7260 (1506.2978) [Time Left: 00:05:05] ht_loss=277.7260 : L1=92.5351 : L2=92.5909 : L3=92.6000
-->[train]: [epoch-000-150][110/247] Time 0.98 (2.08) Data 0.00 (1.04) Forward 0.05 (1.15) Loss_all 278.5226 (1395.6688) [Time Left: 00:04:42] ht_loss=278.5226 : L1=92.8076 : L2=92.8527 : L3=92.8623
-->[train]: [epoch-000-150][120/247] Time 4.90 (2.08) Data 3.86 (1.05) Forward 3.98 (1.16) Loss_all 277.3715 (1303.3518) [Time Left: 00:04:22] ht_loss=277.3715 : L1=92.4822 : L2=92.4495 : L3=92.4398
-->[train]: [epoch-000-150][130/247] Time 1.01 (2.06) Data 0.00 (1.03) Forward 0.05 (1.14) Loss_all 277.4960 (1225.1186) [Time Left: 00:03:59] ht_loss=277.4960 : L1=92.5309 : L2=92.4777 : L3=92.4875
-->[train]: [epoch-000-150][140/247] Time 4.89 (2.07) Data 3.85 (1.04) Forward 3.89 (1.14) Loss_all 277.4550 (1157.9637) [Time Left: 00:03:39] ht_loss=277.4550 : L1=92.4373 : L2=92.4946 : L3=92.5231
-->[train]: [epoch-000-150][150/247] Time 0.97 (2.07) Data 0.00 (1.04) Forward 0.05 (1.14) Loss_all 277.0612 (1099.6810) [Time Left: 00:03:18] ht_loss=277.0612 : L1=92.3476 : L2=92.3554 : L3=92.3583
-->[train]: [epoch-000-150][160/247] Time 4.87 (2.09) Data 3.86 (1.06) Forward 3.90 (1.16) Loss_all 278.3372 (1048.6711) [Time Left: 00:02:59] ht_loss=278.3372 : L1=92.7852 : L2=92.7728 : L3=92.7792
-->[train]: [epoch-000-150][170/247] Time 1.05 (2.07) Data 0.00 (1.05) Forward 0.05 (1.14) Loss_all 278.7277 (1003.6236) [Time Left: 00:02:37] ht_loss=278.7277 : L1=92.8838 : L2=92.9143 : L3=92.9296
-->[train]: [epoch-000-150][180/247] Time 5.15 (2.08) Data 4.04 (1.06) Forward 4.21 (1.15) Loss_all 278.2527 (963.5538) [Time Left: 00:02:17] ht_loss=278.2527 : L1=92.8155 : L2=92.7081 : L3=92.7291
-->[train]: [epoch-000-150][190/247] Time 1.03 (2.05) Data 0.00 (1.04) Forward 0.05 (1.12) Loss_all 277.6393 (927.6484) [Time Left: 00:01:55] ht_loss=277.6393 : L1=92.5422 : L2=92.5445 : L3=92.5525
-->[train]: [epoch-000-150][200/247] Time 4.82 (2.08) Data 3.81 (1.06) Forward 3.86 (1.15) Loss_all 278.6940 (895.3700) [Time Left: 00:01:35] ht_loss=278.6940 : L1=92.8379 : L2=92.9121 : L3=92.9440
-->[train]: [epoch-000-150][210/247] Time 1.01 (2.07) Data 0.00 (1.05) Forward 0.05 (1.14) Loss_all 278.6920 (866.1021) [Time Left: 00:01:14] ht_loss=278.6920 : L1=92.8799 : L2=92.8969 : L3=92.9152
-->[train]: [epoch-000-150][220/247] Time 4.31 (2.09) Data 3.27 (1.07) Forward 3.32 (1.15) Loss_all 277.6641 (839.5094) [Time Left: 00:00:54] ht_loss=277.6641 : L1=92.5900 : L2=92.5280 : L3=92.5461
-->[train]: [epoch-000-150][230/247] Time 0.98 (2.08) Data 0.00 (1.06) Forward 0.05 (1.14) Loss_all 277.3657 (815.2032) [Time Left: 00:00:33] ht_loss=277.3657 : L1=92.4130 : L2=92.4656 : L3=92.4871
-->[train]: [epoch-000-150][240/247] Time 5.31 (2.09) Data 4.35 (1.07) Forward 4.40 (1.16) Loss_all 276.9100 (792.9332) [Time Left: 00:00:12] ht_loss=276.9100 : L1=92.2631 : L2=92.3141 : L3=92.3328
-->[train]: [epoch-000-150][246/247] Time 2.29 (2.09) Data 0.00 (1.07) Forward 1.03 (1.15) Loss_all 278.2914 (781.8199) [Time Left: 00:00:00] ht_loss=278.2914 : L1=92.7296 : L2=92.7785 : L3=92.7834
Eval dataset length 31528, labeled data length 31528
Compute NME for 31528 images with 68 points :: [(nms): mean=164.630, std=33.857]
==>>[2021-01-04 10:56:13] Train [epoch-000-150] Average Loss = 781.819878, NME = 164.63
save checkpoint into ../snopshots/checkpoint/HEATMAP-epoch-000-150.pth
save checkpoint into ../snopshots/last-info.pth

==>>[2021-01-04 10:56:13] [epoch-001-150], [[Time Left: 21:42:18]], LR : [0.00050 ~ 0.00050], Config : {'epochs': 150, 'num_pts': 68, 'sigma': 4, 'print_freq': 10, 'downsample': 8, 'shape': [256, 256]}
-->[train]: [epoch-001-150][000/247] Time 8.17 (8.17) Data 7.04 (7.04) Forward 7.16 (7.16) Loss_all 278.1276 (278.1276) [Time Left: 00:33:30] ht_loss=278.1276 : L1=92.7402 : L2=92.6863 : L3=92.7011
-->[train]: [epoch-001-150][010/247] Time 1.02 (2.70) Data 0.00 (1.68) Forward 0.05 (1.74) Loss_all 277.0975 (278.3596) [Time Left: 00:10:37] ht_loss=277.0975 : L1=92.3605 : L2=92.3538 : L3=92.3832
-->[train]: [epoch-001-150][020/247] Time 4.71 (2.55) Data 3.67 (1.54) Forward 3.72 (1.59) Loss_all 277.3455 (278.1147) [Time Left: 00:09:37] ht_loss=277.3455 : L1=92.4743 : L2=92.4308 : L3=92.4404
-->[train]: [epoch-001-150][030/247] Time 1.08 (2.31) Data 0.00 (1.29) Forward 0.11 (1.34) Loss_all 278.7683 (278.1484) [Time Left: 00:08:18] ht_loss=278.7683 : L1=92.9498 : L2=92.9005 : L3=92.9180
-->[train]: [epoch-001-150][040/247] Time 7.17 (2.35) Data 6.17 (1.33) Forward 6.22 (1.38) Loss_all 278.2269 (278.1230) [Time Left: 00:08:04] ht_loss=278.2269 : L1=92.6964 : L2=92.7372 : L3=92.7933
-->[train]: [epoch-001-150][050/247] Time 1.03 (2.30) Data 0.00 (1.27) Forward 0.05 (1.32) Loss_all 276.7727 (278.0273) [Time Left: 00:07:29] ht_loss=276.7727 : L1=92.2674 : L2=92.2390 : L3=92.2664
-->[train]: [epoch-001-150][060/247] Time 5.33 (2.29) Data 4.27 (1.26) Forward 4.32 (1.32) Loss_all 277.9707 (277.9931) [Time Left: 00:07:06] ht_loss=277.9707 : L1=92.5334 : L2=92.6936 : L3=92.7436
-->[train]: [epoch-001-150][070/247] Time 1.08 (2.23) Data 0.00 (1.20) Forward 0.07 (1.25) Loss_all 276.6379 (277.9465) [Time Left: 00:06:32] ht_loss=276.6379 : L1=92.0694 : L2=92.2572 : L3=92.3113
-->[train]: [epoch-001-150][080/247] Time 5.25 (2.25) Data 4.22 (1.21) Forward 4.27 (1.26) Loss_all 278.9980 (277.9131) [Time Left: 00:06:12] ht_loss=278.9980 : L1=92.9393 : L2=93.0119 : L3=93.0468
-->[train]: [epoch-001-150][090/247] Time 1.11 (2.20) Data 0.00 (1.16) Forward 0.06 (1.22) Loss_all 278.4267 (277.9751) [Time Left: 00:05:43] ht_loss=278.4267 : L1=92.7253 : L2=92.8161 : L3=92.8853
-->[train]: [epoch-001-150][100/247] Time 3.97 (2.22) Data 2.93 (1.18) Forward 2.98 (1.23) Loss_all 277.6642 (277.8997) [Time Left: 00:05:23] ht_loss=277.6642 : L1=92.5115 : L2=92.5662 : L3=92.5865
-->[train]: [epoch-001-150][110/247] Time 1.00 (2.21) Data 0.00 (1.17) Forward 0.04 (1.22) Loss_all 278.4783 (277.8227) [Time Left: 00:05:00] ht_loss=278.4783 : L1=92.5632 : L2=92.6771 : L3=93.2380
-->[train]: [epoch-001-150][120/247] Time 1.53 (2.18) Data 0.52 (1.14) Forward 0.56 (1.20) Loss_all 278.9060 (277.7973) [Time Left: 00:04:34] ht_loss=278.9060 : L1=92.9879 : L2=92.9261 : L3=92.9920
-->[train]: [epoch-001-150][130/247] Time 1.04 (2.20) Data 0.00 (1.16) Forward 0.07 (1.22) Loss_all 275.1124 (277.7316) [Time Left: 00:04:15] ht_loss=275.1124 : L1=91.6005 : L2=91.7149 : L3=91.7970
-->[train]: [epoch-001-150][140/247] Time 1.36 (2.19) Data 0.35 (1.15) Forward 0.39 (1.21) Loss_all 277.0453 (277.6980) [Time Left: 00:03:52] ht_loss=277.0453 : L1=92.2058 : L2=92.3897 : L3=92.4498
-->[train]: [epoch-001-150][150/247] Time 1.05 (2.20) Data 0.00 (1.16) Forward 0.05 (1.22) Loss_all 277.4876 (277.6891) [Time Left: 00:03:31] ht_loss=277.4876 : L1=92.4398 : L2=92.4923 : L3=92.5554
-->[train]: [epoch-001-150][160/247] Time 1.01 (2.19) Data 0.00 (1.15) Forward 0.06 (1.20) Loss_all 276.7957 (277.6487) [Time Left: 00:03:08] ht_loss=276.7957 : L1=92.1740 : L2=92.2719 : L3=92.3497
-->[train]: [epoch-001-150][170/247] Time 1.07 (2.21) Data 0.00 (1.17) Forward 0.05 (1.23) Loss_all 274.4813 (277.5769) [Time Left: 00:02:48] ht_loss=274.4813 : L1=91.2472 : L2=91.5625 : L3=91.6715
-->[train]: [epoch-001-150][180/247] Time 4.76 (2.20) Data 3.73 (1.16) Forward 3.78 (1.22) Loss_all 277.1411 (277.5542) [Time Left: 00:02:25] ht_loss=277.1411 : L1=92.2720 : L2=92.3849 : L3=92.4842
-->[train]: [epoch-001-150][190/247] Time 1.03 (2.19) Data 0.00 (1.15) Forward 0.06 (1.20) Loss_all 276.4174 (277.5101) [Time Left: 00:02:02] ht_loss=276.4174 : L1=91.9825 : L2=92.1831 : L3=92.2517
-->[train]: [epoch-001-150][200/247] Time 1.08 (2.18) Data 0.00 (1.14) Forward 0.08 (1.20) Loss_all 276.7092 (277.4698) [Time Left: 00:01:40] ht_loss=276.7092 : L1=92.0072 : L2=92.3023 : L3=92.3998
-->[train]: [epoch-001-150][210/247] Time 1.01 (2.19) Data 0.00 (1.15) Forward 0.05 (1.21) Loss_all 277.3184 (277.4343) [Time Left: 00:01:18] ht_loss=277.3184 : L1=92.2769 : L2=92.4722 : L3=92.5693
-->[train]: [epoch-001-150][220/247] Time 2.94 (2.19) Data 1.92 (1.15) Forward 1.96 (1.20) Loss_all 275.9615 (277.3970) [Time Left: 00:00:56] ht_loss=275.9615 : L1=91.9907 : L2=91.9326 : L3=92.0382
-->[train]: [epoch-001-150][230/247] Time 1.04 (2.19) Data 0.00 (1.15) Forward 0.07 (1.20) Loss_all 276.7830 (277.3277) [Time Left: 00:00:35] ht_loss=276.7830 : L1=91.9892 : L2=92.3281 : L3=92.4656
-->[train]: [epoch-001-150][240/247] Time 4.19 (2.20) Data 3.19 (1.16) Forward 3.24 (1.22) Loss_all 275.8693 (277.2501) [Time Left: 00:00:13] ht_loss=275.8693 : L1=91.6268 : L2=92.0341 : L3=92.2084
-->[train]: [epoch-001-150][246/247] Time 0.37 (2.19) Data 0.00 (1.15) Forward 0.04 (1.21) Loss_all 277.2675 (277.2426) [Time Left: 00:00:00] ht_loss=277.2675 : L1=92.4372 : L2=92.3524 : L3=92.4779
Eval dataset length 31528, labeled data length 31528
Compute NME for 31528 images with 68 points :: [(nms): mean=165.022, std=33.734]
==>>[2021-01-04 11:05:24] Train [epoch-001-150] Average Loss = 277.242612, NME = 165.02
save checkpoint into ../snopshots/checkpoint/HEATMAP-epoch-001-150.pth
save checkpoint into ../snopshots/last-info.pth
Basic-Eval-All evaluates 1 dataset
==>>[2021-01-04 11:05:24], [epoch-001-150], evaluate the 0/1-th dataset [image] : GeneralDataset(point-num=68, shape=[256, 256], sigma=4, heatmap_type=gaussian, length=689, cutout=0, dataset=test_300w)
-->[test]: [epoch-001-150][000/006] Time 6.60 (6.60) Data 6.19 (6.19) Forward 6.22 (6.22) Loss_all 280.0911 (280.0911) [Time Left: 00:00:33] ht_loss=280.0911 : L1=93.1269 : L2=93.3953 : L3=93.5689
-->[test]: [epoch-001-150][005/006] Time 1.23 (1.57) Data 0.00 (1.03) Forward 1.14 (1.25) Loss_all 280.1683 (279.7528) [Time Left: 00:00:00] ht_loss=280.1683 : L1=93.1425 : L2=93.4033 : L3=93.6225
Eval dataset length 689, labeled data length 689
Compute NME for 689 images with 68 points :: [(nms): mean=166.901, std=24.249]
NME Results :
->test_300w : NME = 166.901,

@Dian-Yi
Copy link
Author

Dian-Yi commented Jan 7, 2021

I did some experiments to find problems.
There is a sample heatmap label data from: def generate_label_map(pts, height, width, sigma, downsample, nopoints, ctype):
the sum of each map's value > 0, sigma:4, downsample:8:
[ 7, 6, 8, 7, 7, 7, 7, 7, 7, 8, 7, 8,
7, 7, 7, 7, 8, 8, 7, 8, 7, 7, 8, 8,
8, 7, 8, 7, 7, 7, 7, 8, 7, 7, 7, 7,
7, 7, 7, 8, 7, 7, 7, 7, 7, 8, 6, 8,
7, 7, 8, 7, 8, 8, 7, 7, 6, 7, 7, 8,
8, 8, 7, 8, 7, 7, 8, 7, 1024]
the max value of each maps:
[0.8979, 0.8032, 0.6506, 0.6165, 0.6372, 0.4791, 0.5207, 0.4606, 0.8633,
0.8925, 0.8269, 0.7173, 0.8287, 0.5979, 0.5846, 0.6472, 0.7989, 0.7900,
0.9217, 0.6162, 0.6745, 0.5610, 0.8472, 0.8779, 0.0000, 0.9160, 0.5829,
0.4558, 0.6468, 0.4131, 0.6791, 0.7397, 0.8015, 0.6352, 0.7229, 0.6346,
0.4943, 0.6630, 0.8280, 0.6312, 0.9719, 0.9219, 0.7392, 0.7575, 0.6396,
0.9408, 0.8714, 0.8833, 0.6204, 0.6979, 0.6662, 0.8334, 0.7105, 0.7566,
0.9396, 0.8781, 0.6446, 0.9268, 0.7921, 0.9549, 0.5869, 0.9428, 0.8311,
0.9922, 0.5339, 0.8815, 0.9315, 0.9609, 1.0000]

you can find the number of background label is 32*32. There is an imbalance between background and foreground. so i change the function(generate_label_map) and add AwingLoss and loss weight maps. They are all from a paper(AwingLoss). After,my model can be trianed well.
Have you ever encountered such a problem when you train the ProCpm model ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant