Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Degreded preformance after first epoch #1147

Open
YumingChang02 opened this issue Dec 7, 2022 · 1 comment
Open

Degreded preformance after first epoch #1147

YumingChang02 opened this issue Dec 7, 2022 · 1 comment

Comments

@YumingChang02
Copy link

馃悰 Describe the bug

running imagenet main.py in pytorch examples github link

=> creating model 'mobilenet_v3_small'
Epoch: [0][   1/1424]   Time 60.000 (60.000)    Data 15.133 (15.133)    Loss 6.9078e+00 (6.9078e+00)    Acc@1   0.00 (  0.00)   Acc@5   0.33 (  0.33)
Epoch: [0][  21/1424]   Time  1.175 ( 4.022)    Data  0.001 ( 0.722)    Loss 6.8997e+00 (6.9054e+00)    Acc@1   0.33 (  0.11)   Acc@5   0.78 (  0.60)
Epoch: [0][  41/1424]   Time  1.186 ( 2.652)    Data  0.001 ( 0.384)    Loss 6.8966e+00 (6.9001e+00)    Acc@1   0.00 (  0.17)   Acc@5   0.56 (  0.70)
Epoch: [0][  61/1424]   Time  1.205 ( 2.193)    Data  0.001 ( 0.269)    Loss 6.8423e+00 (6.8909e+00)    Acc@1   0.22 (  0.17)   Acc@5   1.33 (  0.75)
Epoch: [0][  81/1424]   Time  1.188 ( 1.951)    Data  0.001 ( 0.209)    Loss 6.8351e+00 (6.8794e+00)    Acc@1   0.00 (  0.16)   Acc@5   1.00 (  0.80)
Epoch: [0][ 101/1424]   Time  1.199 ( 1.805)    Data  0.001 ( 0.174)    Loss 6.8079e+00 (6.8674e+00)    Acc@1   0.44 (  0.17)   Acc@5   1.11 (  0.86)
Epoch: [0][ 121/1424]   Time  1.168 ( 1.706)    Data  0.001 ( 0.149)    Loss 6.7743e+00 (6.8548e+00)    Acc@1   0.22 (  0.18)   Acc@5   1.78 (  0.89)
Epoch: [0][ 141/1424]   Time  1.234 ( 1.636)    Data  0.001 ( 0.132)    Loss 6.7362e+00 (6.8401e+00)    Acc@1   0.00 (  0.20)   Acc@5   1.33 (  0.96)
Epoch: [0][ 161/1424]   Time  1.221 ( 1.583)    Data  0.001 ( 0.119)    Loss 6.7279e+00 (6.8240e+00)    Acc@1   0.33 (  0.23)   Acc@5   1.11 (  1.06)
Epoch: [0][ 181/1424]   Time  1.216 ( 1.546)    Data  0.001 ( 0.109)    Loss 6.6291e+00 (6.8059e+00)    Acc@1   0.89 (  0.27)   Acc@5   2.11 (  1.16)
Epoch: [0][ 201/1424]   Time  1.201 ( 1.513)    Data  0.001 ( 0.101)    Loss 6.5352e+00 (6.7850e+00)    Acc@1   0.78 (  0.29)   Acc@5   2.89 (  1.27)
Epoch: [0][ 221/1424]   Time  1.207 ( 1.485)    Data  0.001 ( 0.095)    Loss 6.4816e+00 (6.7620e+00)    Acc@1   0.89 (  0.32)   Acc@5   3.00 (  1.38)
Epoch: [0][ 241/1424]   Time  1.219 ( 1.462)    Data  0.001 ( 0.089)    Loss 6.3946e+00 (6.7378e+00)    Acc@1   1.11 (  0.35)   Acc@5   3.56 (  1.48)
Epoch: [0][ 261/1424]   Time  1.209 ( 1.442)    Data  0.001 ( 0.084)    Loss 6.3972e+00 (6.7125e+00)    Acc@1   0.89 (  0.38)   Acc@5   2.78 (  1.61)
Epoch: [0][ 281/1424]   Time  1.234 ( 1.425)    Data  0.001 ( 0.080)    Loss 6.3541e+00 (6.6868e+00)    Acc@1   0.67 (  0.41)   Acc@5   3.56 (  1.78)
Epoch: [0][ 301/1424]   Time  1.174 ( 1.411)    Data  0.001 ( 0.077)    Loss 6.2194e+00 (6.6605e+00)    Acc@1   1.33 (  0.46)   Acc@5   4.89 (  1.94)
Epoch: [0][ 321/1424]   Time  1.230 ( 1.399)    Data  0.001 ( 0.074)    Loss 6.2427e+00 (6.6346e+00)    Acc@1   1.00 (  0.49)   Acc@5   4.00 (  2.11)
Epoch: [0][ 341/1424]   Time  1.202 ( 1.387)    Data  0.001 ( 0.071)    Loss 6.1785e+00 (6.6091e+00)    Acc@1   1.00 (  0.54)   Acc@5   4.56 (  2.28)
Epoch: [0][ 361/1424]   Time  1.215 ( 1.377)    Data  0.001 ( 0.069)    Loss 6.1027e+00 (6.5836e+00)    Acc@1   1.78 (  0.58)   Acc@5   6.44 (  2.45)
Epoch: [0][ 381/1424]   Time  1.175 ( 1.369)    Data  0.001 ( 0.067)    Loss 6.1400e+00 (6.5580e+00)    Acc@1   1.22 (  0.62)   Acc@5   5.22 (  2.63)
Epoch: [0][ 401/1424]   Time  1.216 ( 1.361)    Data  0.001 ( 0.065)    Loss 6.0648e+00 (6.5339e+00)    Acc@1   1.89 (  0.67)   Acc@5   7.22 (  2.82)
Epoch: [0][ 421/1424]   Time  1.192 ( 1.353)    Data  0.001 ( 0.063)    Loss 6.0373e+00 (6.5096e+00)    Acc@1   2.33 (  0.73)   Acc@5   7.56 (  2.99)
Epoch: [0][ 441/1424]   Time  1.213 ( 1.347)    Data  0.001 ( 0.061)    Loss 5.9490e+00 (6.4868e+00)    Acc@1   2.11 (  0.78)   Acc@5   7.00 (  3.17)
Epoch: [0][ 461/1424]   Time  1.209 ( 1.341)    Data  0.001 ( 0.060)    Loss 5.8554e+00 (6.4637e+00)    Acc@1   2.89 (  0.83)   Acc@5   8.67 (  3.36)
Epoch: [0][ 481/1424]   Time  1.181 ( 1.335)    Data  0.001 ( 0.059)    Loss 5.8994e+00 (6.4411e+00)    Acc@1   2.00 (  0.89)   Acc@5   7.89 (  3.55)
Epoch: [0][ 501/1424]   Time  1.195 ( 1.331)    Data  0.001 ( 0.058)    Loss 5.8446e+00 (6.4185e+00)    Acc@1   2.67 (  0.94)   Acc@5   8.89 (  3.75)
Epoch: [0][ 521/1424]   Time  1.124 ( 1.326)    Data  0.001 ( 0.057)    Loss 5.8407e+00 (6.3959e+00)    Acc@1   2.33 (  1.00)   Acc@5   7.33 (  3.93)
Epoch: [0][ 541/1424]   Time  1.214 ( 1.322)    Data  0.001 ( 0.055)    Loss 5.7237e+00 (6.3733e+00)    Acc@1   2.78 (  1.06)   Acc@5  11.22 (  4.15)
Epoch: [0][ 561/1424]   Time  1.205 ( 1.318)    Data  0.001 ( 0.054)    Loss 5.7182e+00 (6.3512e+00)    Acc@1   3.11 (  1.13)   Acc@5  10.11 (  4.37)
Epoch: [0][ 581/1424]   Time  1.190 ( 1.314)    Data  0.001 ( 0.054)    Loss 5.6628e+00 (6.3281e+00)    Acc@1   3.78 (  1.20)   Acc@5  11.89 (  4.60)
Epoch: [0][ 601/1424]   Time  1.224 ( 1.310)    Data  0.001 ( 0.053)    Loss 5.6361e+00 (6.3064e+00)    Acc@1   3.78 (  1.27)   Acc@5  11.33 (  4.82)
Epoch: [0][ 621/1424]   Time  1.188 ( 1.307)    Data  0.001 ( 0.052)    Loss 5.6024e+00 (6.2845e+00)    Acc@1   3.11 (  1.33)   Acc@5  10.78 (  5.04)
Epoch: [0][ 641/1424]   Time  1.190 ( 1.304)    Data  0.001 ( 0.051)    Loss 5.4781e+00 (6.2628e+00)    Acc@1   4.11 (  1.40)   Acc@5  14.00 (  5.27)
Epoch: [0][ 661/1424]   Time  1.168 ( 1.301)    Data  0.001 ( 0.050)    Loss 5.5298e+00 (6.2410e+00)    Acc@1   3.67 (  1.48)   Acc@5  11.78 (  5.51)
Epoch: [0][ 681/1424]   Time  1.179 ( 1.298)    Data  0.001 ( 0.050)    Loss 5.5839e+00 (6.2199e+00)    Acc@1   3.33 (  1.56)   Acc@5  12.89 (  5.75)
Epoch: [0][ 701/1424]   Time  1.175 ( 1.296)    Data  0.001 ( 0.049)    Loss 5.4905e+00 (6.2003e+00)    Acc@1   3.78 (  1.63)   Acc@5  13.44 (  5.97)
Epoch: [0][ 721/1424]   Time  1.209 ( 1.294)    Data  0.001 ( 0.049)    Loss 5.4298e+00 (6.1797e+00)    Acc@1   4.11 (  1.71)   Acc@5  13.78 (  6.20)
Epoch: [0][ 741/1424]   Time  1.187 ( 1.292)    Data  0.001 ( 0.048)    Loss 5.3948e+00 (6.1603e+00)    Acc@1   4.56 (  1.79)   Acc@5  14.22 (  6.43)
Epoch: [0][ 761/1424]   Time  1.204 ( 1.289)    Data  0.001 ( 0.048)    Loss 5.3875e+00 (6.1405e+00)    Acc@1   5.56 (  1.87)   Acc@5  14.67 (  6.67)
Epoch: [0][ 781/1424]   Time  1.208 ( 1.287)    Data  0.001 ( 0.047)    Loss 5.3482e+00 (6.1205e+00)    Acc@1   5.78 (  1.96)   Acc@5  17.44 (  6.91)
Epoch: [0][ 801/1424]   Time  1.221 ( 1.285)    Data  0.001 ( 0.047)    Loss 5.2558e+00 (6.1018e+00)    Acc@1   4.67 (  2.04)   Acc@5  16.33 (  7.14)
Epoch: [0][ 821/1424]   Time  1.197 ( 1.283)    Data  0.001 ( 0.046)    Loss 5.3466e+00 (6.0833e+00)    Acc@1   6.00 (  2.12)   Acc@5  16.89 (  7.37)
Epoch: [0][ 841/1424]   Time  1.194 ( 1.282)    Data  0.001 ( 0.046)    Loss 5.2687e+00 (6.0649e+00)    Acc@1   5.33 (  2.20)   Acc@5  17.44 (  7.59)
Epoch: [0][ 861/1424]   Time  1.217 ( 1.280)    Data  0.001 ( 0.045)    Loss 5.3550e+00 (6.0473e+00)    Acc@1   5.78 (  2.28)   Acc@5  14.56 (  7.81)
Epoch: [0][ 881/1424]   Time  1.235 ( 1.278)    Data  0.001 ( 0.045)    Loss 5.3109e+00 (6.0297e+00)    Acc@1   6.00 (  2.37)   Acc@5  18.00 (  8.04)
Epoch: [0][ 901/1424]   Time  1.212 ( 1.277)    Data  0.001 ( 0.045)    Loss 5.3358e+00 (6.0127e+00)    Acc@1   6.44 (  2.45)   Acc@5  18.33 (  8.26)
Epoch: [0][ 921/1424]   Time  1.180 ( 1.275)    Data  0.001 ( 0.044)    Loss 5.2877e+00 (5.9953e+00)    Acc@1   5.89 (  2.53)   Acc@5  18.11 (  8.50)
Epoch: [0][ 941/1424]   Time  1.211 ( 1.274)    Data  0.001 ( 0.044)    Loss 5.2182e+00 (5.9784e+00)    Acc@1   6.44 (  2.62)   Acc@5  18.33 (  8.72)
Epoch: [0][ 961/1424]   Time  1.231 ( 1.272)    Data  0.001 ( 0.044)    Loss 5.1222e+00 (5.9617e+00)    Acc@1   7.89 (  2.70)   Acc@5  20.89 (  8.95)
Epoch: [0][ 981/1424]   Time  1.198 ( 1.271)    Data  0.001 ( 0.043)    Loss 5.1630e+00 (5.9452e+00)    Acc@1   7.67 (  2.79)   Acc@5  20.33 (  9.17)
Epoch: [0][1001/1424]   Time  1.230 ( 1.270)    Data  0.001 ( 0.043)    Loss 5.0712e+00 (5.9293e+00)    Acc@1   7.33 (  2.87)   Acc@5  19.56 (  9.38)
Epoch: [0][1021/1424]   Time  1.192 ( 1.269)    Data  0.001 ( 0.043)    Loss 5.0792e+00 (5.9138e+00)    Acc@1   7.00 (  2.95)   Acc@5  20.00 (  9.61)
Epoch: [0][1041/1424]   Time  1.183 ( 1.268)    Data  0.001 ( 0.042)    Loss 5.1826e+00 (5.8981e+00)    Acc@1   6.22 (  3.03)   Acc@5  19.33 (  9.82)
Epoch: [0][1061/1424]   Time  1.208 ( 1.267)    Data  0.001 ( 0.042)    Loss 5.1046e+00 (5.8823e+00)    Acc@1   7.56 (  3.12)   Acc@5  22.67 ( 10.05)
Epoch: [0][1081/1424]   Time  1.224 ( 1.266)    Data  0.001 ( 0.042)    Loss 5.0414e+00 (5.8669e+00)    Acc@1   9.00 (  3.20)   Acc@5  23.22 ( 10.26)
Epoch: [0][1101/1424]   Time  1.188 ( 1.265)    Data  0.001 ( 0.042)    Loss 5.0559e+00 (5.8515e+00)    Acc@1   8.33 (  3.29)   Acc@5  20.67 ( 10.48)
Epoch: [0][1121/1424]   Time  1.216 ( 1.265)    Data  0.001 ( 0.041)    Loss 5.0840e+00 (5.8369e+00)    Acc@1   7.44 (  3.38)   Acc@5  22.22 ( 10.70)
Epoch: [0][1141/1424]   Time  1.223 ( 1.263)    Data  0.001 ( 0.041)    Loss 4.9830e+00 (5.8229e+00)    Acc@1   8.78 (  3.46)   Acc@5  21.56 ( 10.89)
Epoch: [0][1161/1424]   Time  1.173 ( 1.262)    Data  0.001 ( 0.041)    Loss 4.9043e+00 (5.8086e+00)    Acc@1   8.11 (  3.54)   Acc@5  22.78 ( 11.11)
Epoch: [0][1181/1424]   Time  1.217 ( 1.262)    Data  0.001 ( 0.041)    Loss 4.9920e+00 (5.7944e+00)    Acc@1   8.44 (  3.62)   Acc@5  23.44 ( 11.32)
Epoch: [0][1201/1424]   Time  1.208 ( 1.261)    Data  0.001 ( 0.041)    Loss 4.9523e+00 (5.7801e+00)    Acc@1   9.00 (  3.71)   Acc@5  21.78 ( 11.53)
Epoch: [0][1221/1424]   Time  1.286 ( 1.260)    Data  0.001 ( 0.040)    Loss 4.9660e+00 (5.7661e+00)    Acc@1   8.33 (  3.80)   Acc@5  23.56 ( 11.73)
Epoch: [0][1241/1424]   Time  1.171 ( 1.259)    Data  0.001 ( 0.040)    Loss 4.9641e+00 (5.7521e+00)    Acc@1   8.67 (  3.89)   Acc@5  23.00 ( 11.94)
Epoch: [0][1261/1424]   Time  1.212 ( 1.259)    Data  0.001 ( 0.040)    Loss 5.0178e+00 (5.7388e+00)    Acc@1   8.78 (  3.97)   Acc@5  24.33 ( 12.14)
Epoch: [0][1281/1424]   Time  1.203 ( 1.258)    Data  0.001 ( 0.040)    Loss 4.9352e+00 (5.7259e+00)    Acc@1   8.78 (  4.06)   Acc@5  24.67 ( 12.34)
Epoch: [0][1301/1424]   Time  1.186 ( 1.257)    Data  0.001 ( 0.040)    Loss 4.8461e+00 (5.7127e+00)    Acc@1  11.67 (  4.14)   Acc@5  27.22 ( 12.54)
Epoch: [0][1321/1424]   Time  1.209 ( 1.256)    Data  0.001 ( 0.039)    Loss 4.9172e+00 (5.6998e+00)    Acc@1  10.11 (  4.23)   Acc@5  23.44 ( 12.73)
Epoch: [0][1341/1424]   Time  1.186 ( 1.256)    Data  0.001 ( 0.039)    Loss 4.7432e+00 (5.6869e+00)    Acc@1  12.00 (  4.31)   Acc@5  28.56 ( 12.93)
Epoch: [0][1361/1424]   Time  1.217 ( 1.255)    Data  0.001 ( 0.039)    Loss 4.8297e+00 (5.6744e+00)    Acc@1  10.89 (  4.40)   Acc@5  24.89 ( 13.12)
Epoch: [0][1381/1424]   Time  1.203 ( 1.255)    Data  0.001 ( 0.039)    Loss 4.9124e+00 (5.6622e+00)    Acc@1   9.00 (  4.48)   Acc@5  25.22 ( 13.31)
Epoch: [0][1401/1424]   Time  1.215 ( 1.254)    Data  0.000 ( 0.039)    Loss 4.8593e+00 (5.6498e+00)    Acc@1  10.56 (  4.57)   Acc@5  25.22 ( 13.51)
Epoch: [0][1421/1424]   Time  1.271 ( 1.253)    Data  0.000 ( 0.039)    Loss 4.7977e+00 (5.6378e+00)    Acc@1  10.33 (  4.65)   Acc@5  27.56 ( 13.70)
Test: [ 1/56]   Time 19.546 (19.546)    Loss 4.2155e+00 (4.2155e+00)    Acc@1  16.11 ( 16.11)   Acc@5  44.11 ( 44.11)
Test: [21/56]   Time  0.352 ( 1.719)    Loss 5.6829e+00 (4.9246e+00)    Acc@1   2.56 (  8.41)   Acc@5  10.33 ( 24.26)
Test: [41/56]   Time  0.345 ( 1.552)    Loss 4.9785e+00 (4.9935e+00)    Acc@1   8.44 (  8.10)   Acc@5  22.11 ( 23.13)
 *   Acc@1 8.258 Acc@5 23.380
Epoch: [1][   1/1424]   Time 17.457 (17.457)    Data 15.719 (15.719)    Loss 4.8509e+00 (4.8509e+00)    Acc@1  10.00 ( 10.00)   Acc@5  25.00 ( 25.00)
Epoch: [1][  21/1424]   Time  1.705 ( 2.485)    Data  0.001 ( 0.770)    Loss 4.9256e+00 (4.7646e+00)    Acc@1   9.78 ( 10.74)   Acc@5  24.33 ( 27.21)
Epoch: [1][  41/1424]   Time  1.741 ( 2.127)    Data  0.001 ( 0.419)    Loss 4.7073e+00 (4.7602e+00)    Acc@1  11.22 ( 10.82)   Acc@5  27.89 ( 27.44)
Epoch: [1][  61/1424]   Time  1.709 ( 2.004)    Data  0.001 ( 0.297)    Loss 4.7274e+00 (4.7496e+00)    Acc@1  10.67 ( 11.00)   Acc@5  29.33 ( 27.64)
Epoch: [1][  81/1424]   Time  1.739 ( 1.940)    Data  0.001 ( 0.236)    Loss 4.6097e+00 (4.7386e+00)    Acc@1  11.56 ( 11.05)   Acc@5  30.00 ( 27.81)

during the run, there is a kernel message

[Wed Dec  7 14:44:07 2022] amdgpu: Runlist is getting oversubscribed. Expect reduced ROCm performance.
[Wed Dec  7 14:44:09 2022] amdgpu: Runlist is getting oversubscribed. Expect reduced ROCm performance.

System info with inxi ( host machine, running rocm/pytorch:latest )

markchang@X99-TF-8 ~> inxi -Fnx
System:
  Host: X99-TF-8 Kernel: 6.0.11-arch1-1 arch: x86_64 bits: 64 compiler: gcc v: 12.2.0
    Console: pty pts/5 Distro: Arch Linux
Machine:
  Type: Desktop System: HUANANZHI product: N/A v: N/A serial: <superuser required>
  Mobo: HUANANZHI model: X99-TF-Q GAMING v: V1.2 serial: <superuser required>
    UEFI: American Megatrends v: 5.11 date: 07/06/2022
CPU:
  Info: 12-core model: Intel Xeon E5-2673 v3 bits: 64 type: MT MCP arch: Haswell rev: 2 cache:
    L1: 768 KiB L2: 3 MiB L3: 30 MiB
  Speed (MHz): avg: 2136 high: 3100 min/max: 1200/3100 cores: 1: 2351 2: 2694 3: 1200 4: 2295
    5: 2399 6: 2694 7: 2394 8: 2395 9: 1197 10: 3100 11: 2195 12: 2694 13: 2300 14: 2694 15: 1199
    16: 2294 17: 2394 18: 2000 19: 2394 20: 2394 21: 1200 22: 1200 23: 1200 24: 2409
    bogomips: 114965
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx
Graphics:
  Device-1: AMD Navi 23 [Radeon RX 6600/6600 XT/6600M] vendor: Tul / PowerColor driver: amdgpu
    v: kernel arch: RDNA-2 bus-ID: 06:00.0
  Device-2: AMD Navi 23 [Radeon RX 6600/6600 XT/6600M] vendor: Tul / PowerColor driver: amdgpu
    v: kernel arch: RDNA-2 bus-ID: 09:00.0

Versions

note this is running in docker ( rocm/pytorch:latest )

Collecting environment information...
PyTorch version: 1.13.0a0+git941769a
Is debug build: False
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: 5.4.22801-aaa1e3d8

OS: Ubuntu 20.04.4 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: 15.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.4.0 22465 d6f0fe8b22e3d8ce0f2cbd657ea14b16043018a5)
CMake version: version 3.22.1
Libc version: glibc-2.31

Python version: 3.8.13 (default, Mar 28 2022, 11:38:47)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-6.0.11-arch1-1-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to:
GPU models and configuration: AMD Radeon RX 6600
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: 5.4.22801
MIOpen runtime version: 2.19.0
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] mypy==0.960
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.22.4
[pip3] torch==1.13.0a0+git941769a
[pip3] torchvision==0.14.0a0+bd70a78
[conda] mkl                       2022.0.1           h06a4308_117
[conda] mkl-include               2022.0.1           h06a4308_117
[conda] numpy                     1.22.4                   pypi_0    pypi
[conda] torch                     1.13.0a0+git941769a          pypi_0    pypi
[conda] torchvision               0.14.0a0+bd70a78          pypi_0    pypi
@sunway513
Copy link

Hi @YumingChang02 , can you watch for your GPU operation temperature after the first epoch? if it's getting to high you might experience slower performance.
watch -n 0.1 rocm-smi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants