Reassociate: add global reassociation algorithm #6598

lizhengxing · 2024-05-08T01:17:44Z

This PR pulls the upstream change, Reassociate: add global reassociation algorithm (llvm/llvm-project@b8a330c), into DXC with miminal changes.

For the code below:
foo = (a * b) * c
bar = (a * d) * c

As the upstream change states, it can identify the a*c is a common factor and redundant.

This is part 1 of the fix for #6593.

dmpots · 2024-05-08T15:22:29Z

What is the performance impact of this change?

lizhengxing · 2024-05-08T16:59:18Z

What is the performance impact of this change?

I haven't run the perf tests. But I believe the impact's smaller than the one I got before. That's why I want to merge the 2 PRs together or merge them consecutively.

dmpots · 2024-05-08T17:06:20Z

What is the performance impact of this change?

I haven't run the perf tests. But I believe the impact's smaller than the one I got before. That's why I want to merge the 2 PRs together or merge them consecutively.

It would be good to get the perf numbers for this change to understand the impact of each part. Also, since it will be possible to run this change by itself with a flag it would be good to understand what the perf looks like.

lizhengxing · 2024-05-08T17:17:57Z

What is the performance impact of this change?

I haven't run the perf tests. But I believe the impact's smaller than the one I got before. That's why I want to merge the 2 PRs together or merge them consecutively.

It would be good to get the perf numbers for this change to understand the impact of each part. Also, since it will be possible to run this change by itself with a flag it would be good to understand what the perf looks like.

There's no flag for this upstream change. I thought to use a flag, but it needs to change the interface of Reassociate Pass to pass in the flag. I'm not sure if it's acceptable.
FunctionPass *llvm::createReassociatePass() { return new Reassociate(); }

Anyway, I started to collect the perf number.

dmpots · 2024-05-08T18:08:21Z

What is the performance impact of this change?

I haven't run the perf tests. But I believe the impact's smaller than the one I got before. That's why I want to merge the 2 PRs together or merge them consecutively.

It would be good to get the perf numbers for this change to understand the impact of each part. Also, since it will be possible to run this change by itself with a flag it would be good to understand what the perf looks like.

There's no flag for this upstream change. I thought to use a flag, but it needs to change the interface of Reassociate Pass to pass in the flag. I'm not sure if it's acceptable. FunctionPass *llvm::createReassociatePass() { return new Reassociate(); }

Anyway, I started to collect the perf number.

I don't think we need a flag for this change, but I do think we need a sanity check on the perf numbers that it is an overall positive change. For the flag, I was referring to the flag you added in the other PR (was something like -disable-agressive-reassoc). That flag was not controlling this modification, so I was just asking that we understand the perf of this change in isolation.

dmpots

I'd like to see the perf data for this change.

dmpots · 2024-05-08T23:43:31Z

I'd like to see the perf data for this change.

Synced offline. Looks like a nice win overall in our test suite. Reduced ALU in ~40% of shaders (increased it in ~3%). A small number of shaders had occupancy impacts (~1%). There were an equal number of regressions and improvements.

Overall, this change looks to be positive.

tools/clang/test/DXC/Passes/Transforms/Reassociate/basictest.ll

llvm-beanz

I approved the PR because I think this is good, but it would be nice to rebase this on main and put the tests with the other reassociate tests in LLVM.

lizhengxing · 2024-05-15T17:50:08Z

I approved the PR because I think this is good, but it would be nice to rebase this on main and put the tests with the other reassociate tests in LLVM.

Done. I rebased this PR on main branch and updated the tests.

This PR (#6598) pulls the upstream global reassociation algorithm change in DXC and can reduce redundant calculations obviously. However, from the testing result of a large offline suite of shaders, some shaders got worse compilation results and couldn't benefit from this upstream change. This PR adds a flag for the upstream global reassociation change. It would be easier to roll back if a shader get worse compilation result due to this upstream change. This is part 2 of the fix for #6593.

Although DXC applied the upstream change, Reassociate: add global reassociation algorithm (llvm/llvm-project@b8a330c) in this PR (#6598), it still might overlook some obvious common factors. One case has been observed is: %2 = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 1) %3 = extractvalue %dx.types.CBufRet.f32 %2, 3 %4 = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 0) %5 = extractvalue %dx.types.CBufRet.f32 %4, 1 %6 = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 1) %7 = extractvalue %dx.types.CBufRet.f32 %6, 3 %8 = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 0) %9 = extractvalue %dx.types.CBufRet.f32 %8, 1 .... %11 = fmul fast float %3, %10 %12 = fmul fast float %11, %5 .... %14 = fmul fast float %7, %13 %15 = fmul fast float %14, %9 ---> %3*%5 == %7*%9 --> they should be reassociated to a common factor The upstream change can't identify this common factor because DXC doesn't know (%3, %7) and (%7, %9) are redundant when running Reassociate pass. The redundancy of (%3, %7) and (%7, %9) will be eliminated in GVN pass. For DXC can identify more common factors, this PR will aggressively run Reassociate pass again after GVN pass and then run GVN pass again to remove the redundancies generared in this run of Reassociate pass. Changing the order of floating point operations causes the precision issue. In case some shaders get unexpected results due to this PR, use "-opt-disable enable-aggressive-reassociation" to disable this PR and roll back. This is part 3 of the fix for #6593.

This PR (#6598) pulls the upstream global reassociation algorithm change in DXC and can reduce redundant calculations obviously. However, from the testing result of a large offline suite of shaders, some shaders got worse compilation results and couldn't benefit from this upstream change. This PR adds a flag for the upstream global reassociation change. It would be easier to roll back if a shader get worse compilation result due to this upstream change. This is part 2 of the fix for #6593.

Although DXC applied the upstream change, Reassociate: add global reassociation algorithm (llvm/llvm-project@b8a330c) in this PR (#6598), it still might overlook some obvious common factors. One case has been observed is: %Float4_0 = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 1) %Float4_0.w = extractvalue %dx.types.CBufRet.f32 %Float4, 3 %Float2_0 = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 0) %Float2_0.y = extractvalue %dx.types.CBufRet.f32 %4, 1 %Float4_1 = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 1) ---> %Float4_1 is redundant with %Float4_0 since they invokes cbufferLoadLegacy with same parameters %Float4_1.w = extractvalue %dx.types.CBufRet.f32 %6, 3 ---> %Float4_1.w is redundant with %Float4_0.w %Float2_1 = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 0) ---> %Float2_1 is redundant with %Float2_0 since they invokes cbufferLoadLegacy with same parameters %Float2_1.y = extractvalue %dx.types.CBufRet.f32 %8, 1 ---> %Float2_1.y is redundant with %Float2_0.y .... %11 = fmul fast float %Float4_0.w, %10 %12 = fmul fast float %11, %Float2_0.y .... %14 = fmul fast float %Float4_1.w, %13 %15 = fmul fast float %14, %Float2_1.y ---> (%Float4_0.w * %Float2_0.y) equals to (%Float4_1.w * %Float2_1.y) --> they should be reassociated to a common factor The upstream change can't identify this common factor because DXC doesn't know (%Float4_0.w, %Float4_1.w) and (%Float2_0.y, %Float2_1.y) are redundant when running Reassociate pass. Those redundancies will be eliminated in GVN pass. For DXC can identify more common factors, this PR will aggressively run Reassociate pass again after GVN pass and then run GVN pass again to remove the redundancies generared in this run of Reassociate pass. Changing the order of floating point operations causes the precision issue. In case some shaders get unexpected results due to this PR, use "-opt-disable aggressive-reassociation" to disable this PR and roll back. This is part 3 of the fix for #6593.

Although DXC applied the upstream change, Reassociate: add global reassociation algorithm (llvm/llvm-project@b8a330c) in this PR (#6598), it still might overlook some obvious common factors. One case has been observed is: %Float4_0 = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 1) %Float4_0.w = extractvalue %dx.types.CBufRet.f32 %Float4_0, 3 %Float2_0 = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 0) %Float2_0.y = extractvalue %dx.types.CBufRet.f32 %Float2_0, 1 /* %Float4_1 is redundant with %Float4_0 since they invokes cbufferLoadLegacy with same parameters */ %Float4_1 = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 1) /* %Float4_1.w is redundant with %Float4_0.w */ %Float4_1.w = extractvalue %dx.types.CBufRet.f32 %Float4_1, 3 /* %Float2_1 is redundant with %Float2_0 since they invokes cbufferLoadLegacy with same parameters */ %Float2_1 = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 0) /* %Float2_1.y is redundant with %Float2_0.y */ %Float2_1.y = extractvalue %dx.types.CBufRet.f32 %Float2_1, 1 .... %11 = fmul fast float %Float4_0.w, %10 %12 = fmul fast float %11, %Float2_0.y .... %14 = fmul fast float %Float4_1.w, %13 %15 = fmul fast float %14, %Float2_1.y (%Float4_0.w * %Float2_0.y) equals to (%Float4_1.w * %Float2_1.y), they should be reassociated to a common factor The upstream change can't identify this common factor because DXC doesn't know (%Float4_0.w, %Float4_1.w) and (%Float2_0.y, %Float2_1.y) are redundant when running Reassociate pass. Those redundancies will be eliminated in GVN pass. For DXC can identify more common factors, this PR will aggressively run Reassociate pass again after GVN pass and then run GVN pass again to remove the redundancies generared in this run of Reassociate pass. Changing the order of floating point operations causes the precision issue. In case some shaders get unexpected results due to this PR, use "-opt-disable aggressive-reassociation" to disable this PR and roll back. This is part 3 of the fix for #6593.

This PR (#6598) pulls the upstream global reassociation algorithm change in DXC and can reduce redundant calculations obviously. However, from the testing result of a large offline suite of shaders, some shaders got worse compilation results and couldn't benefit from this upstream change. This PR adds a flag for the upstream global reassociation change. It would be easier to roll back if a shader get worse compilation result due to this upstream change. This is part 2 of the fix for #6593.

Although DXC applied the upstream change, Reassociate: add global reassociation algorithm (llvm/llvm-project@b8a330c) in this PR (#6598), it still might overlook some obvious common factors. One case has been observed is: %Float4_0 = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 1) %Float4_0.w = extractvalue %dx.types.CBufRet.f32 %Float4_0, 3 %Float2_0 = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 0) %Float2_0.y = extractvalue %dx.types.CBufRet.f32 %Float2_0, 1 /* %Float4_1 is redundant with %Float4_0 since they invokes cbufferLoadLegacy with same parameters */ %Float4_1 = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 1) /* %Float4_1.w is redundant with %Float4_0.w */ %Float4_1.w = extractvalue %dx.types.CBufRet.f32 %Float4_1, 3 /* %Float2_1 is redundant with %Float2_0 since they invokes cbufferLoadLegacy with same parameters */ %Float2_1 = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 0) /* %Float2_1.y is redundant with %Float2_0.y */ %Float2_1.y = extractvalue %dx.types.CBufRet.f32 %Float2_1, 1 .... %11 = fmul fast float %Float4_0.w, %10 %12 = fmul fast float %11, %Float2_0.y .... %14 = fmul fast float %Float4_1.w, %13 %15 = fmul fast float %14, %Float2_1.y (%Float4_0.w * %Float2_0.y) equals to (%Float4_1.w * %Float2_1.y), they should be reassociated to a common factor The upstream change can't identify this common factor because DXC doesn't know (%Float4_0.w, %Float4_1.w) and (%Float2_0.y, %Float2_1.y) are redundant when running Reassociate pass. Those redundancies will be eliminated in GVN pass. For DXC can identify more common factors, this PR will aggressively run Reassociate pass again after GVN pass and then run GVN pass again to remove the redundancies generared in this run of Reassociate pass. Changing the order of floating point operations causes the precision issue. In case some shaders get unexpected results due to this PR, use "-opt-disable aggressive-reassociation" to disable this PR and roll back. This is part 3 of the fix for #6593.

This PR pulls the upstream change, Reassociate: add global reassociation algorithm (llvm/llvm-project@b8a330c), into DXC with miminal changes. For the code below: foo = (a * b) * c bar = (a * d) * c As the upstream change states, it can identify the a*c is a common factor and redundant. This is part 1 of the fix for #6593.

This PR (#6598) pulls the upstream global reassociation algorithm change in DXC and can reduce redundant calculations obviously. However, from the testing result of a large offline suite of shaders, some shaders got worse compilation results and couldn't benefit from this upstream change. This PR adds a flag for the upstream global reassociation change. It would be easier to roll back if a shader get worse compilation result due to this upstream change. This is part 2 of the fix for #6593.

) This PR (#6598) pulls the upstream global reassociation algorithm change in DXC and can reduce redundant calculations obviously. However, from the testing result of a large offline suite of shaders, some shaders got worse compilation results and couldn't benefit from this upstream change. This PR adds a flag for the upstream global reassociation change. It would be easier to roll back if a shader get worse compilation result due to this upstream change. This is part 2 of the fix for #6593.

Although DXC applied the upstream change, Reassociate: add global reassociation algorithm (llvm/llvm-project@b8a330c) in this PR (#6598), it still might overlook some obvious common factors. One case has been observed is: %Float4_0 = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 1) %Float4_0.w = extractvalue %dx.types.CBufRet.f32 %Float4_0, 3 %Float2_0 = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 0) %Float2_0.y = extractvalue %dx.types.CBufRet.f32 %Float2_0, 1 /* %Float4_1 is redundant with %Float4_0 since they invokes cbufferLoadLegacy with same parameters */ %Float4_1 = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 1) /* %Float4_1.w is redundant with %Float4_0.w */ %Float4_1.w = extractvalue %dx.types.CBufRet.f32 %Float4_1, 3 /* %Float2_1 is redundant with %Float2_0 since they invokes cbufferLoadLegacy with same parameters */ %Float2_1 = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 0) /* %Float2_1.y is redundant with %Float2_0.y */ %Float2_1.y = extractvalue %dx.types.CBufRet.f32 %Float2_1, 1 .... %11 = fmul fast float %Float4_0.w, %10 %12 = fmul fast float %11, %Float2_0.y .... %14 = fmul fast float %Float4_1.w, %13 %15 = fmul fast float %14, %Float2_1.y (%Float4_0.w * %Float2_0.y) equals to (%Float4_1.w * %Float2_1.y), they should be reassociated to a common factor The upstream change can't identify this common factor because DXC doesn't know (%Float4_0.w, %Float4_1.w) and (%Float2_0.y, %Float2_1.y) are redundant when running Reassociate pass. Those redundancies will be eliminated in GVN pass. For DXC can identify more common factors, this PR will aggressively run Reassociate pass again after GVN pass and then run GVN pass again to remove the redundancies generared in this run of Reassociate pass. Changing the order of floating point operations causes the precision issue. In case some shaders get unexpected results due to this PR, use "-opt-disable aggressive-reassociation" to disable this PR and roll back. This is part 3 of the fix for #6593.

Although DXC applied the upstream change, Reassociate: add global reassociation algorithm (llvm/llvm-project@b8a330c) in this PR (#6598), it still might overlook some obvious common factors. One case has been observed is: ``` %Float4_0 = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 1) %Float4_0.w = extractvalue %dx.types.CBufRet.f32 %Float4_0, 3 %Float2_0 = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 0) %Float2_0.y = extractvalue %dx.types.CBufRet.f32 %Float2_0, 1 /* %Float4_1 is redundant with %Float4_0 since they invokes cbufferLoadLegacy with same parameters */ %Float4_1 = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 1) /* %Float4_1.w is redundant with %Float4_0.w */ %Float4_1.w = extractvalue %dx.types.CBufRet.f32 %Float4_1, 3 /* %Float2_1 is redundant with %Float2_0 since they invokes cbufferLoadLegacy with same parameters */ %Float2_1 = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 0) /* %Float2_1.y is redundant with %Float2_0.y */ %Float2_1.y = extractvalue %dx.types.CBufRet.f32 %Float2_1, 1 .... %11 = fmul fast float %Float4_0.w, %10 %12 = fmul fast float %11, %Float2_0.y .... %14 = fmul fast float %Float4_1.w, %13 %15 = fmul fast float %14, %Float2_1.y (%Float4_0.w * %Float2_0.y) equals to (%Float4_1.w * %Float2_1.y), they should be reassociated to a common factor ``` The upstream change can't identify this common factor because DXC doesn't know (%Float4_0.w, %Float4_1.w) and (%Float2_0.y, %Float2_1.y) are redundant when running Reassociate pass. Those redundancies will be eliminated in GVN pass. For DXC can identify more common factors, this PR will aggressively run Reassociate pass again after GVN pass and then run GVN pass again to remove the redundancies generared in this run of Reassociate pass. Changing the order of floating point operations causes the precision issue. In case some shaders get unexpected results due to this PR, use "-opt-disable aggressive-reassociation" to disable this PR and roll back. This is part 3 of the fix for #6593.

This PR pulls the upstream change, Reassociate: add global reassociation algorithm (llvm/llvm-project@b8a330c), into DXC with miminal changes. For the code below: foo = (a * b) * c bar = (a * d) * c As the upstream change states, it can identify the a*c is a common factor and redundant. This is part 1 of the fix for #6593. (cherry picked from commit 6f9c107)

lizhengxing requested a review from a team as a code owner May 8, 2024 01:17

lizhengxing requested review from tex3d, pow2clk, bogner, llvm-beanz and dmpots May 8, 2024 01:24

python3kgae approved these changes May 8, 2024

View reviewed changes

lizhengxing requested a review from python3kgae May 8, 2024 15:08

dmpots requested changes May 8, 2024

View reviewed changes

dmpots approved these changes May 8, 2024

View reviewed changes

llvm-beanz reviewed May 9, 2024

View reviewed changes

tools/clang/test/DXC/Passes/Transforms/Reassociate/basictest.ll Outdated Show resolved Hide resolved

llvm-beanz approved these changes May 10, 2024

View reviewed changes

llvm-beanz reviewed May 10, 2024

View reviewed changes

lizhengxing force-pushed the zhengxingli/common-factor-optimization-upstream branch from e5a215b to 7776de8 Compare May 15, 2024 17:47

lizhengxing mentioned this pull request May 15, 2024

Add a flag for the upstream global reassociation algorithm change #6625

Merged

lizhengxing mentioned this pull request May 15, 2024

More aggressive reassociations #6626

Merged

tex3d approved these changes May 21, 2024

View reviewed changes

lizhengxing force-pushed the zhengxingli/common-factor-optimization-upstream branch from 7776de8 to e3a604a Compare May 21, 2024 17:31

lizhengxing merged commit 6f9c107 into main May 21, 2024
13 checks passed

lizhengxing mentioned this pull request May 30, 2024

Non-determinism in Reassociate caused by address coincidences #6659

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reassociate: add global reassociation algorithm #6598

Reassociate: add global reassociation algorithm #6598

lizhengxing commented May 8, 2024

dmpots commented May 8, 2024

lizhengxing commented May 8, 2024

dmpots commented May 8, 2024

lizhengxing commented May 8, 2024 •

edited

dmpots commented May 8, 2024

dmpots left a comment

dmpots commented May 8, 2024

llvm-beanz left a comment

lizhengxing commented May 15, 2024

Reassociate: add global reassociation algorithm #6598

Reassociate: add global reassociation algorithm #6598

Conversation

lizhengxing commented May 8, 2024

dmpots commented May 8, 2024

lizhengxing commented May 8, 2024

dmpots commented May 8, 2024

lizhengxing commented May 8, 2024 • edited

dmpots commented May 8, 2024

dmpots left a comment

Choose a reason for hiding this comment

dmpots commented May 8, 2024

llvm-beanz left a comment

Choose a reason for hiding this comment

lizhengxing commented May 15, 2024

lizhengxing commented May 8, 2024 •

edited