JIT: Optimize counted loops by reversing them #100915

jakobbotsch · 2024-04-11T14:06:23Z

The JIT should be able to optimize counted loops like

private static void Foo()
{
    for (int i = 0; i < 100; i++)
    {
        Bar();
    }
}

to the equivalent downwards counting loop, which is more efficient on both x64 and arm64.

Current codegen:
x64:

G_M35517_IG02:  ;; offset=0x0005
       xor      ebx, ebx
						;; size=2 bbWeight=1 PerfScore 0.25

G_M35517_IG03:  ;; offset=0x0007
       call     [RangeCheck_Overflow:Bar()]
       inc      ebx
       cmp      ebx, 100
       jl       SHORT G_M35517_IG03
						;; size=13 bbWeight=4 PerfScore 18.00

arm64:

G_M35517_IG02:  ;; offset=0x000C
            mov     w19, wzr
						;; size=4 bbWeight=1 PerfScore 0.50

G_M35517_IG03:  ;; offset=0x0010
            movz    x0, #0x6820      // code for Program:Bar()
            movk    x0, #0x6A29 LSL #16
            movk    x0, #0x7FF9 LSL #32
            ldr     x0, [x0]
            blr     x0
            add     w19, w19, #1
            cmp     w19, #100
            blt     G_M35517_IG03
						;; size=32 bbWeight=4 PerfScore 30.00

Expected codegen:
x64:

G_M35517_IG02:  ;; offset=0x0005
       mov      ebx, 100
						;; size=5 bbWeight=1 PerfScore 0.25

G_M35517_IG03:  ;; offset=0x000A
       call     [RangeCheck_Overflow:Bar()]
       dec      ebx
       jne      SHORT G_M35517_IG03
						;; size=10 bbWeight=4 PerfScore 17.00

arm64:

G_M35517_IG02:  ;; offset=0x000C
            mov     w19, #100
						;; size=4 bbWeight=1 PerfScore 0.50

G_M35517_IG03:  ;; offset=0x0010
            movz    x0, #0x6820      // code for Program:Bar()
            movk    x0, #0x6A2B LSL #16
            movk    x0, #0x7FF9 LSL #32
            ldr     x0, [x0]
            blr     x0
            sub     w19, w19, #1
            cbnz    w19, G_M35517_IG03
						;; size=28 bbWeight=4 PerfScore 28.00

dotnet-policy-service · 2024-04-11T14:06:40Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

EgorBo · 2024-04-11T14:57:53Z

 sub     w19, w19, #1
 cbnz    w19, G_M35517_IG03

Probably better to use subs that sets flags + b.ne

AndyAyersMS · 2024-04-12T17:56:55Z

Also why do we not hoist that big immediate...?

This builds out some initial reasoning about trip counts of loops and utilizes it to convert upwards counted loops into downwards counted loops when beneficial. The trip count of a loop is defined to be the number of times the header block is entered from a back edge. When this value can be computed the loop is called counted. The computation here is symbolic and can reason in terms of variables, such as array or span lengths. To be able to compute the trip count requires the JIT to reason about overflow and to prove various conditions related to the start and end values of the loop. For example, a loop `for (int i = 0; i <= n; i++)` only has a determinable trip count if we can prove that `n < int.MaxValue`. The implementation here utilizes the logic provided by RBO to prove these conditions. In many cases we aren't able to prove them and thus must give up, but this should be improvable in an incremental fashion to handle common cases. Converting a counted loop to a downwards counting loop is beneficial if the index is not being used for anything else but the loop test. In those cases our target platforms are able to combine the decrement with the exit test into a single instruction. This transformation does not have that many hits (as you may imagine, the indices of loops are usually used for something else). However, once strength reduction is implemented we expect that this transformation will be significantly more important since strength reduction in many cases is going to remove all uses of an index except the mutation and the loop test. The reasoning about trip counts is itself also needed by strength reduction which also needs it to prove no overflow in various cases. Example: ```csharp private static int Foo(int[] arr, int start, int count) { int sum = 0; for (int i = 0; i < count; i++) { sum += arr[start++]; } return sum; } ``` ```diff @@ -1,20 +1,18 @@ G_M42127_IG02: ;; offset=0x0004 xor eax, eax - xor r10d, r10d test r8d, r8d jle SHORT G_M42127_IG05 - ;; size=10 bbWeight=1 PerfScore 1.75 -G_M42127_IG03: ;; offset=0x000E - mov r9d, dword ptr [rcx+0x08] + ;; size=7 bbWeight=1 PerfScore 1.50 +G_M42127_IG03: ;; offset=0x000B + mov r10d, dword ptr [rcx+0x08] ;; size=4 bbWeight=0.25 PerfScore 0.50 -G_M42127_IG04: ;; offset=0x0012 - lea r9d, [rdx+0x01] +G_M42127_IG04: ;; offset=0x000F + lea r10d, [rdx+0x01] cmp edx, dword ptr [rcx+0x08] jae SHORT G_M42127_IG06 mov edx, edx add eax, dword ptr [rcx+4*rdx+0x10] - inc r10d - cmp r10d, r8d - mov edx, r9d - jl SHORT G_M42127_IG04 - ;; size=26 bbWeight=4 PerfScore 38.00 + dec r8d + mov edx, r10d + jne SHORT G_M42127_IG04 + ;; size=23 bbWeight=4 PerfScore 37.00 ``` Fix dotnet#100915

EgorBo · 2024-05-17T16:01:57Z

Also why do we not hoist that big immediate...?

Could be related: #89213 (comment)

…into downwards counted loops (#102261) This builds out some initial reasoning about trip counts of loops and utilizes it to convert upwards counted loops into downwards counted loops when beneficial. The trip count of a loop is defined to be the number of times the header block is entered. When this value can be computed the loop is called counted. The computation here is symbolic and can reason in terms of variables, such as array or span lengths. To be able to compute the trip count requires the JIT to reason about overflow and to prove various conditions related to the start and end values of the loop. For example, a loop `for (int i = 0; i <= n; i++)` only has a determinable trip count if we can prove that `n < int.MaxValue`. The implementation here utilizes the logic provided by RBO to prove these conditions. In many cases we aren't able to prove them and thus must give up, but this should be improvable in an incremental fashion to handle common cases. Converting a counted loop to a downwards counting loop is beneficial if the induction variable is not being used for anything else but the loop test. In those cases our target platforms are able to combine the decrement with the exit test into a single instruction. More importantly this usually frees up a register inside the loop. This transformation does not have that many hits (as one can imagine, the IVs of loops are usually used for something else). However, once strength reduction is implemented we expect that this transformation will be significantly more important since strength reduction in many cases is going to remove all uses of an IV except the mutation and the loop test. The reasoning about trip counts is itself also needed by strength reduction which also needs it to prove no overflow in various cases. TP regressions are going to be pretty large for this change: - This enables DFS tree/loop finding in IV opts phase outside win-x64, which has cost around 0.4% TP on its own - This optimization furthermore requires us to build dominators, which comes with its own TP cost Long term we could remove these costs if we could avoid changing control flow in assertion prop and move RBO to the end of the opts loop (letting all control flow changes happen there). But for now I think we just have to pay some of the costs to allow us to do these optimizations. Example: ```csharp private static int Foo(int[] arr, int start, int count) { int sum = 0; for (int i = 0; i < count; i++) { sum += arr[start]; start++; } return sum; } ``` ```diff @@ -1,19 +1,17 @@ G_M42127_IG02: ;; offset=0x0004 xor eax, eax - xor r10d, r10d test r8d, r8d jle SHORT G_M42127_IG05 - ;; size=10 bbWeight=1 PerfScore 1.75 -G_M42127_IG03: ;; offset=0x000E - mov r9d, dword ptr [rcx+0x08] + ;; size=7 bbWeight=1 PerfScore 1.50 +G_M42127_IG03: ;; offset=0x000B + mov r10d, dword ptr [rcx+0x08] mov edx, edx ;; size=6 bbWeight=0.25 PerfScore 0.56 -G_M42127_IG04: ;; offset=0x0014 +G_M42127_IG04: ;; offset=0x0011 cmp edx, dword ptr [rcx+0x08] jae SHORT G_M42127_IG06 add eax, dword ptr [rcx+4*rdx+0x10] inc edx - inc r10d - cmp r10d, r8d - jl SHORT G_M42127_IG04 - ;; size=19 bbWeight=4 PerfScore 35.00 + dec r8d + jne SHORT G_M42127_IG04 + ;; size=16 bbWeight=4 PerfScore 34.00 ``` Fix #100915

jakobbotsch · 2024-05-30T12:14:24Z

Could be related: #89213 (comment)

Yeah, looks like it -- the constant isn't exposed in IR during hoisting.

…into downwards counted loops (dotnet#102261) This builds out some initial reasoning about trip counts of loops and utilizes it to convert upwards counted loops into downwards counted loops when beneficial. The trip count of a loop is defined to be the number of times the header block is entered. When this value can be computed the loop is called counted. The computation here is symbolic and can reason in terms of variables, such as array or span lengths. To be able to compute the trip count requires the JIT to reason about overflow and to prove various conditions related to the start and end values of the loop. For example, a loop `for (int i = 0; i <= n; i++)` only has a determinable trip count if we can prove that `n < int.MaxValue`. The implementation here utilizes the logic provided by RBO to prove these conditions. In many cases we aren't able to prove them and thus must give up, but this should be improvable in an incremental fashion to handle common cases. Converting a counted loop to a downwards counting loop is beneficial if the induction variable is not being used for anything else but the loop test. In those cases our target platforms are able to combine the decrement with the exit test into a single instruction. More importantly this usually frees up a register inside the loop. This transformation does not have that many hits (as one can imagine, the IVs of loops are usually used for something else). However, once strength reduction is implemented we expect that this transformation will be significantly more important since strength reduction in many cases is going to remove all uses of an IV except the mutation and the loop test. The reasoning about trip counts is itself also needed by strength reduction which also needs it to prove no overflow in various cases. TP regressions are going to be pretty large for this change: - This enables DFS tree/loop finding in IV opts phase outside win-x64, which has cost around 0.4% TP on its own - This optimization furthermore requires us to build dominators, which comes with its own TP cost Long term we could remove these costs if we could avoid changing control flow in assertion prop and move RBO to the end of the opts loop (letting all control flow changes happen there). But for now I think we just have to pay some of the costs to allow us to do these optimizations. Example: ```csharp private static int Foo(int[] arr, int start, int count) { int sum = 0; for (int i = 0; i < count; i++) { sum += arr[start]; start++; } return sum; } ``` ```diff @@ -1,19 +1,17 @@ G_M42127_IG02: ;; offset=0x0004 xor eax, eax - xor r10d, r10d test r8d, r8d jle SHORT G_M42127_IG05 - ;; size=10 bbWeight=1 PerfScore 1.75 -G_M42127_IG03: ;; offset=0x000E - mov r9d, dword ptr [rcx+0x08] + ;; size=7 bbWeight=1 PerfScore 1.50 +G_M42127_IG03: ;; offset=0x000B + mov r10d, dword ptr [rcx+0x08] mov edx, edx ;; size=6 bbWeight=0.25 PerfScore 0.56 -G_M42127_IG04: ;; offset=0x0014 +G_M42127_IG04: ;; offset=0x0011 cmp edx, dword ptr [rcx+0x08] jae SHORT G_M42127_IG06 add eax, dword ptr [rcx+4*rdx+0x10] inc edx - inc r10d - cmp r10d, r8d - jl SHORT G_M42127_IG04 - ;; size=19 bbWeight=4 PerfScore 35.00 + dec r8d + jne SHORT G_M42127_IG04 + ;; size=16 bbWeight=4 PerfScore 34.00 ``` Fix dotnet#100915

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 11, 2024

jakobbotsch self-assigned this Apr 11, 2024

jakobbotsch added this to the Future milestone Apr 11, 2024

jakobbotsch mentioned this issue May 15, 2024

JIT: Add reasoning about loop trip counts and optimize counted loops into downwards counted loops #102261

Merged

jakobbotsch modified the milestones: Future, 9.0.0 May 15, 2024

jakobbotsch added the Priority:2 Work that is important, but not critical for the release label May 15, 2024

jakobbotsch closed this as completed in #102261 May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: Optimize counted loops by reversing them #100915

JIT: Optimize counted loops by reversing them #100915

jakobbotsch commented Apr 11, 2024

dotnet-policy-service bot commented Apr 11, 2024

EgorBo commented Apr 11, 2024 •

edited

AndyAyersMS commented Apr 12, 2024

EgorBo commented May 17, 2024

jakobbotsch commented May 30, 2024

JIT: Optimize counted loops by reversing them #100915

JIT: Optimize counted loops by reversing them #100915

Comments

jakobbotsch commented Apr 11, 2024

dotnet-policy-service bot commented Apr 11, 2024

EgorBo commented Apr 11, 2024 • edited

AndyAyersMS commented Apr 12, 2024

EgorBo commented May 17, 2024

jakobbotsch commented May 30, 2024

EgorBo commented Apr 11, 2024 •

edited