Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Optimize counted loops by reversing them #100915

Closed
jakobbotsch opened this issue Apr 11, 2024 · 5 comments · Fixed by #102261
Closed

JIT: Optimize counted loops by reversing them #100915

jakobbotsch opened this issue Apr 11, 2024 · 5 comments · Fixed by #102261
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI Priority:2 Work that is important, but not critical for the release
Milestone

Comments

@jakobbotsch
Copy link
Member

The JIT should be able to optimize counted loops like

private static void Foo()
{
    for (int i = 0; i < 100; i++)
    {
        Bar();
    }
}

to the equivalent downwards counting loop, which is more efficient on both x64 and arm64.

Current codegen:
x64:

G_M35517_IG02:  ;; offset=0x0005
       xor      ebx, ebx
						;; size=2 bbWeight=1 PerfScore 0.25

G_M35517_IG03:  ;; offset=0x0007
       call     [RangeCheck_Overflow:Bar()]
       inc      ebx
       cmp      ebx, 100
       jl       SHORT G_M35517_IG03
						;; size=13 bbWeight=4 PerfScore 18.00

arm64:

G_M35517_IG02:  ;; offset=0x000C
            mov     w19, wzr
						;; size=4 bbWeight=1 PerfScore 0.50

G_M35517_IG03:  ;; offset=0x0010
            movz    x0, #0x6820      // code for Program:Bar()
            movk    x0, #0x6A29 LSL #16
            movk    x0, #0x7FF9 LSL #32
            ldr     x0, [x0]
            blr     x0
            add     w19, w19, #1
            cmp     w19, #100
            blt     G_M35517_IG03
						;; size=32 bbWeight=4 PerfScore 30.00

Expected codegen:
x64:

G_M35517_IG02:  ;; offset=0x0005
       mov      ebx, 100
						;; size=5 bbWeight=1 PerfScore 0.25

G_M35517_IG03:  ;; offset=0x000A
       call     [RangeCheck_Overflow:Bar()]
       dec      ebx
       jne      SHORT G_M35517_IG03
						;; size=10 bbWeight=4 PerfScore 17.00

arm64:

G_M35517_IG02:  ;; offset=0x000C
            mov     w19, #100
						;; size=4 bbWeight=1 PerfScore 0.50

G_M35517_IG03:  ;; offset=0x0010
            movz    x0, #0x6820      // code for Program:Bar()
            movk    x0, #0x6A2B LSL #16
            movk    x0, #0x7FF9 LSL #32
            ldr     x0, [x0]
            blr     x0
            sub     w19, w19, #1
            cbnz    w19, G_M35517_IG03
						;; size=28 bbWeight=4 PerfScore 28.00
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 11, 2024
@jakobbotsch jakobbotsch self-assigned this Apr 11, 2024
@jakobbotsch jakobbotsch added this to the Future milestone Apr 11, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@EgorBo
Copy link
Member

EgorBo commented Apr 11, 2024

 sub     w19, w19, #1
 cbnz    w19, G_M35517_IG03

Probably better to use subs that sets flags + b.ne

@AndyAyersMS
Copy link
Member

Also why do we not hoist that big immediate...?

jakobbotsch added a commit to jakobbotsch/runtime that referenced this issue May 15, 2024
This builds out some initial reasoning about trip counts of loops and
utilizes it to convert upwards counted loops into downwards counted
loops when beneficial.

The trip count of a loop is defined to be the number of times the header
block is entered from a back edge. When this value can be computed the
loop is called counted. The computation here is symbolic and can reason
in terms of variables, such as array or span lengths.

To be able to compute the trip count requires the JIT to reason about
overflow and to prove various conditions related to the start and end
values of the loop. For example, a loop `for (int i = 0; i <= n; i++)`
only has a determinable trip count if we can prove that `n <
int.MaxValue`. The implementation here utilizes the logic provided by
RBO to prove these conditions. In many cases we aren't able to prove
them and thus must give up, but this should be improvable in an
incremental fashion to handle common cases.

Converting a counted loop to a downwards counting loop is beneficial if
the index is not being used for anything else but the loop test. In
those cases our target platforms are able to combine the decrement with
the exit test into a single instruction.

This transformation does not have that many hits (as you may imagine,
the indices of loops are usually used for something else). However, once
strength reduction is implemented we expect that this transformation
will be significantly more important since strength reduction in many
cases is going to remove all uses of an index except the mutation and
the loop test.

The reasoning about trip counts is itself also needed by strength
reduction which also needs it to prove no overflow in various cases.

Example:

```csharp
private static int Foo(int[] arr, int start, int count)
{
    int sum = 0;
    for (int i = 0; i < count; i++)
    {
        sum += arr[start++];
    }

    return sum;
}
```

```diff
@@ -1,20 +1,18 @@
 G_M42127_IG02:  ;; offset=0x0004
        xor      eax, eax
-       xor      r10d, r10d
        test     r8d, r8d
        jle      SHORT G_M42127_IG05
-						;; size=10 bbWeight=1 PerfScore 1.75
-G_M42127_IG03:  ;; offset=0x000E
-       mov      r9d, dword ptr [rcx+0x08]
+						;; size=7 bbWeight=1 PerfScore 1.50
+G_M42127_IG03:  ;; offset=0x000B
+       mov      r10d, dword ptr [rcx+0x08]
 						;; size=4 bbWeight=0.25 PerfScore 0.50
-G_M42127_IG04:  ;; offset=0x0012
-       lea      r9d, [rdx+0x01]
+G_M42127_IG04:  ;; offset=0x000F
+       lea      r10d, [rdx+0x01]
        cmp      edx, dword ptr [rcx+0x08]
        jae      SHORT G_M42127_IG06
        mov      edx, edx
        add      eax, dword ptr [rcx+4*rdx+0x10]
-       inc      r10d
-       cmp      r10d, r8d
-       mov      edx, r9d
-       jl       SHORT G_M42127_IG04
-						;; size=26 bbWeight=4 PerfScore 38.00
+       dec      r8d
+       mov      edx, r10d
+       jne      SHORT G_M42127_IG04
+						;; size=23 bbWeight=4 PerfScore 37.00
```

Fix dotnet#100915
@jakobbotsch jakobbotsch modified the milestones: Future, 9.0.0 May 15, 2024
@jakobbotsch jakobbotsch added the Priority:2 Work that is important, but not critical for the release label May 15, 2024
@EgorBo
Copy link
Member

EgorBo commented May 17, 2024

Also why do we not hoist that big immediate...?

Could be related: #89213 (comment)

jakobbotsch added a commit that referenced this issue May 30, 2024
…into downwards counted loops (#102261)

This builds out some initial reasoning about trip counts of loops and utilizes
it to convert upwards counted loops into downwards counted loops when
beneficial.

The trip count of a loop is defined to be the number of times the header block
is entered. When this value can be computed the loop is called counted. The
computation here is symbolic and can reason in terms of variables, such as array
or span lengths.

To be able to compute the trip count requires the JIT to reason about overflow
and to prove various conditions related to the start and end values of the loop.
For example, a loop `for (int i = 0; i <= n; i++)` only has a determinable trip
count if we can prove that `n < int.MaxValue`. The implementation here utilizes
the logic provided by RBO to prove these conditions. In many cases we aren't
able to prove them and thus must give up, but this should be improvable in an
incremental fashion to handle common cases.

Converting a counted loop to a downwards counting loop is beneficial if the
induction variable is not being used for anything else but the loop test. In
those cases our target platforms are able to combine the decrement with the exit
test into a single instruction. More importantly this usually frees up a
register inside the loop.

This transformation does not have that many hits (as one can imagine, the IVs of
loops are usually used for something else). However, once strength reduction is
implemented we expect that this transformation will be significantly more
important since strength reduction in many cases is going to remove all uses of
an IV except the mutation and the loop test.

The reasoning about trip counts is itself also needed by strength reduction
which also needs it to prove no overflow in various cases.

TP regressions are going to be pretty large for this change:
- This enables DFS tree/loop finding in IV opts phase outside win-x64, which has
  cost around 0.4% TP on its own
- This optimization furthermore requires us to build dominators, which comes
  with its own TP cost

Long term we could remove these costs if we could avoid changing control flow in
assertion prop and move RBO to the end of the opts loop (letting all control
flow changes happen there). But for now I think we just have to pay some of the
costs to allow us to do these optimizations.

Example:

```csharp
private static int Foo(int[] arr, int start, int count)
{
    int sum = 0;
    for (int i = 0; i < count; i++)
    {
        sum += arr[start];
        start++;
    }

    return sum;
}
```

```diff
@@ -1,19 +1,17 @@
 G_M42127_IG02:  ;; offset=0x0004
        xor      eax, eax
-       xor      r10d, r10d
        test     r8d, r8d
        jle      SHORT G_M42127_IG05
-						;; size=10 bbWeight=1 PerfScore 1.75
-G_M42127_IG03:  ;; offset=0x000E
-       mov      r9d, dword ptr [rcx+0x08]
+						;; size=7 bbWeight=1 PerfScore 1.50
+G_M42127_IG03:  ;; offset=0x000B
+       mov      r10d, dword ptr [rcx+0x08]
        mov      edx, edx
 						;; size=6 bbWeight=0.25 PerfScore 0.56
-G_M42127_IG04:  ;; offset=0x0014
+G_M42127_IG04:  ;; offset=0x0011
        cmp      edx, dword ptr [rcx+0x08]
        jae      SHORT G_M42127_IG06
        add      eax, dword ptr [rcx+4*rdx+0x10]
        inc      edx
-       inc      r10d
-       cmp      r10d, r8d
-       jl       SHORT G_M42127_IG04
-						;; size=19 bbWeight=4 PerfScore 35.00
+       dec      r8d
+       jne      SHORT G_M42127_IG04
+						;; size=16 bbWeight=4 PerfScore 34.00
```

Fix #100915
@jakobbotsch
Copy link
Member Author

Could be related: #89213 (comment)

Yeah, looks like it -- the constant isn't exposed in IR during hoisting.

Ruihan-Yin pushed a commit to Ruihan-Yin/runtime that referenced this issue May 30, 2024
…into downwards counted loops (dotnet#102261)

This builds out some initial reasoning about trip counts of loops and utilizes
it to convert upwards counted loops into downwards counted loops when
beneficial.

The trip count of a loop is defined to be the number of times the header block
is entered. When this value can be computed the loop is called counted. The
computation here is symbolic and can reason in terms of variables, such as array
or span lengths.

To be able to compute the trip count requires the JIT to reason about overflow
and to prove various conditions related to the start and end values of the loop.
For example, a loop `for (int i = 0; i <= n; i++)` only has a determinable trip
count if we can prove that `n < int.MaxValue`. The implementation here utilizes
the logic provided by RBO to prove these conditions. In many cases we aren't
able to prove them and thus must give up, but this should be improvable in an
incremental fashion to handle common cases.

Converting a counted loop to a downwards counting loop is beneficial if the
induction variable is not being used for anything else but the loop test. In
those cases our target platforms are able to combine the decrement with the exit
test into a single instruction. More importantly this usually frees up a
register inside the loop.

This transformation does not have that many hits (as one can imagine, the IVs of
loops are usually used for something else). However, once strength reduction is
implemented we expect that this transformation will be significantly more
important since strength reduction in many cases is going to remove all uses of
an IV except the mutation and the loop test.

The reasoning about trip counts is itself also needed by strength reduction
which also needs it to prove no overflow in various cases.

TP regressions are going to be pretty large for this change:
- This enables DFS tree/loop finding in IV opts phase outside win-x64, which has
  cost around 0.4% TP on its own
- This optimization furthermore requires us to build dominators, which comes
  with its own TP cost

Long term we could remove these costs if we could avoid changing control flow in
assertion prop and move RBO to the end of the opts loop (letting all control
flow changes happen there). But for now I think we just have to pay some of the
costs to allow us to do these optimizations.

Example:

```csharp
private static int Foo(int[] arr, int start, int count)
{
    int sum = 0;
    for (int i = 0; i < count; i++)
    {
        sum += arr[start];
        start++;
    }

    return sum;
}
```

```diff
@@ -1,19 +1,17 @@
 G_M42127_IG02:  ;; offset=0x0004
        xor      eax, eax
-       xor      r10d, r10d
        test     r8d, r8d
        jle      SHORT G_M42127_IG05
-						;; size=10 bbWeight=1 PerfScore 1.75
-G_M42127_IG03:  ;; offset=0x000E
-       mov      r9d, dword ptr [rcx+0x08]
+						;; size=7 bbWeight=1 PerfScore 1.50
+G_M42127_IG03:  ;; offset=0x000B
+       mov      r10d, dword ptr [rcx+0x08]
        mov      edx, edx
 						;; size=6 bbWeight=0.25 PerfScore 0.56
-G_M42127_IG04:  ;; offset=0x0014
+G_M42127_IG04:  ;; offset=0x0011
        cmp      edx, dword ptr [rcx+0x08]
        jae      SHORT G_M42127_IG06
        add      eax, dword ptr [rcx+4*rdx+0x10]
        inc      edx
-       inc      r10d
-       cmp      r10d, r8d
-       jl       SHORT G_M42127_IG04
-						;; size=19 bbWeight=4 PerfScore 35.00
+       dec      r8d
+       jne      SHORT G_M42127_IG04
+						;; size=16 bbWeight=4 PerfScore 34.00
```

Fix dotnet#100915
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI Priority:2 Work that is important, but not critical for the release
Projects
None yet
3 participants