Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reassociate: add global reassociation algorithm #6598

Merged
merged 1 commit into from
May 21, 2024

Conversation

lizhengxing
Copy link
Collaborator

This PR pulls the upstream change, Reassociate: add global reassociation algorithm (llvm/llvm-project@b8a330c), into DXC with miminal changes.

For the code below:
foo = (a * b) * c
bar = (a * d) * c

As the upstream change states, it can identify the a*c is a common factor and redundant.

This is part 1 of the fix for #6593.

@lizhengxing lizhengxing requested a review from a team as a code owner May 8, 2024 01:17
@dmpots
Copy link
Member

dmpots commented May 8, 2024

What is the performance impact of this change?

@lizhengxing
Copy link
Collaborator Author

What is the performance impact of this change?

I haven't run the perf tests. But I believe the impact's smaller than the one I got before. That's why I want to merge the 2 PRs together or merge them consecutively.

@dmpots
Copy link
Member

dmpots commented May 8, 2024

What is the performance impact of this change?

I haven't run the perf tests. But I believe the impact's smaller than the one I got before. That's why I want to merge the 2 PRs together or merge them consecutively.

It would be good to get the perf numbers for this change to understand the impact of each part. Also, since it will be possible to run this change by itself with a flag it would be good to understand what the perf looks like.

@lizhengxing
Copy link
Collaborator Author

lizhengxing commented May 8, 2024

What is the performance impact of this change?

I haven't run the perf tests. But I believe the impact's smaller than the one I got before. That's why I want to merge the 2 PRs together or merge them consecutively.

It would be good to get the perf numbers for this change to understand the impact of each part. Also, since it will be possible to run this change by itself with a flag it would be good to understand what the perf looks like.

There's no flag for this upstream change. I thought to use a flag, but it needs to change the interface of Reassociate Pass to pass in the flag. I'm not sure if it's acceptable.
FunctionPass *llvm::createReassociatePass() { return new Reassociate(); }

Anyway, I started to collect the perf number.

@dmpots
Copy link
Member

dmpots commented May 8, 2024

What is the performance impact of this change?

I haven't run the perf tests. But I believe the impact's smaller than the one I got before. That's why I want to merge the 2 PRs together or merge them consecutively.

It would be good to get the perf numbers for this change to understand the impact of each part. Also, since it will be possible to run this change by itself with a flag it would be good to understand what the perf looks like.

There's no flag for this upstream change. I thought to use a flag, but it needs to change the interface of Reassociate Pass to pass in the flag. I'm not sure if it's acceptable. FunctionPass *llvm::createReassociatePass() { return new Reassociate(); }

Anyway, I started to collect the perf number.

I don't think we need a flag for this change, but I do think we need a sanity check on the perf numbers that it is an overall positive change. For the flag, I was referring to the flag you added in the other PR (was something like -disable-agressive-reassoc). That flag was not controlling this modification, so I was just asking that we understand the perf of this change in isolation.

Copy link
Member

@dmpots dmpots left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see the perf data for this change.

@dmpots
Copy link
Member

dmpots commented May 8, 2024

I'd like to see the perf data for this change.

Synced offline. Looks like a nice win overall in our test suite. Reduced ALU in ~40% of shaders (increased it in ~3%). A small number of shaders had occupancy impacts (~1%). There were an equal number of regressions and improvements.

Overall, this change looks to be positive.

Copy link
Collaborator

@llvm-beanz llvm-beanz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approved the PR because I think this is good, but it would be nice to rebase this on main and put the tests with the other reassociate tests in LLVM.

@lizhengxing lizhengxing force-pushed the zhengxingli/common-factor-optimization-upstream branch from e5a215b to 7776de8 Compare May 15, 2024 17:47
@lizhengxing
Copy link
Collaborator Author

I approved the PR because I think this is good, but it would be nice to rebase this on main and put the tests with the other reassociate tests in LLVM.

Done. I rebased this PR on main branch and updated the tests.

lizhengxing added a commit that referenced this pull request May 15, 2024
This PR (#6598) pulls the upstream global reassociation algorithm change in DXC and can reduce redundant calculations obviously.

However, from the testing result of a large offline suite of shaders, some shaders got worse compilation results and couldn't benefit from this upstream change.

This PR adds a flag for the upstream global reassociation change. It would be easier to roll back if a shader get worse compilation result due to this upstream change.

This is part 2 of the fix for #6593.
lizhengxing added a commit that referenced this pull request May 15, 2024
This PR (#6598) pulls the upstream global reassociation algorithm change in DXC and can reduce redundant calculations obviously.

However, from the testing result of a large offline suite of shaders, some shaders got worse compilation results and couldn't benefit from this upstream change.

This PR adds a flag for the upstream global reassociation change. It would be easier to roll back if a shader get worse compilation result due to this upstream change.

This is part 2 of the fix for #6593.
lizhengxing added a commit that referenced this pull request May 15, 2024
Although DXC applied the upstream change, Reassociate: add global reassociation algorithm (llvm/llvm-project@b8a330c) in this PR (#6598), it still might overlook some obvious common factors.

One case has been observed is:
  %2 = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 1)
  %3 = extractvalue %dx.types.CBufRet.f32 %2, 3
  %4 = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 0)
  %5 = extractvalue %dx.types.CBufRet.f32 %4, 1
  %6 = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 1)
  %7 = extractvalue %dx.types.CBufRet.f32 %6, 3
  %8 = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 0)
  %9 = extractvalue %dx.types.CBufRet.f32 %8, 1
  ....
  %11 = fmul fast float %3, %10
  %12 = fmul fast float %11, %5
  ....
  %14 = fmul fast float %7, %13
  %15 = fmul fast float %14, %9 ---> %3*%5 == %7*%9 --> they should be reassociated to a common factor

The upstream change can't identify this common factor because DXC doesn't know (%3, %7) and (%7, %9) are redundant when running Reassociate pass. The redundancy of (%3, %7) and (%7, %9) will be eliminated in GVN pass.

For DXC can identify more common factors, this PR will aggressively run Reassociate pass again after GVN pass and then run GVN pass again to remove the redundancies generared in this run of Reassociate pass.

Changing the order of floating point operations causes the precision issue. In case some shaders get unexpected results due to this PR, use "-opt-disable enable-aggressive-reassociation" to disable this PR and roll back.

This is part 3 of the fix for #6593.
lizhengxing added a commit that referenced this pull request May 16, 2024
This PR (#6598) pulls the upstream global reassociation algorithm change in DXC and can reduce redundant calculations obviously.

However, from the testing result of a large offline suite of shaders, some shaders got worse compilation results and couldn't benefit from this upstream change.

This PR adds a flag for the upstream global reassociation change. It would be easier to roll back if a shader get worse compilation result due to this upstream change.

This is part 2 of the fix for #6593.
lizhengxing added a commit that referenced this pull request May 16, 2024
Although DXC applied the upstream change, Reassociate: add global reassociation algorithm (llvm/llvm-project@b8a330c) in this PR (#6598), it still might overlook some obvious common factors.

One case has been observed is:
  %Float4_0   = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 1)
  %Float4_0.w = extractvalue %dx.types.CBufRet.f32 %Float4, 3
  %Float2_0   = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 0)
  %Float2_0.y = extractvalue %dx.types.CBufRet.f32 %4, 1
  %Float4_1   = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 1)  ---> %Float4_1 is redundant with %Float4_0 since they invokes cbufferLoadLegacy with same parameters
  %Float4_1.w = extractvalue %dx.types.CBufRet.f32 %6, 3                                                     ---> %Float4_1.w is redundant with %Float4_0.w
  %Float2_1   = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 0)  ---> %Float2_1 is redundant with %Float2_0 since they invokes cbufferLoadLegacy with same parameters
  %Float2_1.y = extractvalue %dx.types.CBufRet.f32 %8, 1                                                     ---> %Float2_1.y is redundant with %Float2_0.y
  ....
  %11 = fmul fast float %Float4_0.w, %10
  %12 = fmul fast float %11, %Float2_0.y
  ....
  %14 = fmul fast float %Float4_1.w, %13
  %15 = fmul fast float %14, %Float2_1.y    ---> (%Float4_0.w * %Float2_0.y) equals to (%Float4_1.w * %Float2_1.y) --> they should be reassociated to a common factor

The upstream change can't identify this common factor because DXC doesn't know (%Float4_0.w, %Float4_1.w) and (%Float2_0.y, %Float2_1.y) are redundant when running Reassociate pass. Those redundancies will be eliminated in GVN pass.

For DXC can identify more common factors, this PR will aggressively run Reassociate pass again after GVN pass and then run GVN pass again to remove the redundancies generared in this run of Reassociate pass.

Changing the order of floating point operations causes the precision issue. In case some shaders get unexpected results due to this PR, use "-opt-disable aggressive-reassociation" to disable this PR and roll back.

This is part 3 of the fix for #6593.
lizhengxing added a commit that referenced this pull request May 16, 2024
Although DXC applied the upstream change, Reassociate: add global reassociation algorithm (llvm/llvm-project@b8a330c) in this PR (#6598), it still might overlook some obvious common factors.

One case has been observed is:
  %Float4_0   = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 1)
  %Float4_0.w = extractvalue %dx.types.CBufRet.f32 %Float4_0, 3
  %Float2_0   = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 0)
  %Float2_0.y = extractvalue %dx.types.CBufRet.f32 %Float2_0, 1

  /* %Float4_1 is redundant with %Float4_0 since they invokes cbufferLoadLegacy with same parameters */
  %Float4_1   = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 1)

  /* %Float4_1.w is redundant with %Float4_0.w */
  %Float4_1.w = extractvalue %dx.types.CBufRet.f32 %Float4_1, 3

  /* %Float2_1 is redundant with %Float2_0 since they invokes cbufferLoadLegacy with same parameters */
  %Float2_1   = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 0)

  /* %Float2_1.y is redundant with %Float2_0.y */
  %Float2_1.y = extractvalue %dx.types.CBufRet.f32 %Float2_1, 1

  ....
  %11 = fmul fast float %Float4_0.w, %10
  %12 = fmul fast float %11, %Float2_0.y
  ....
  %14 = fmul fast float %Float4_1.w, %13
  %15 = fmul fast float %14, %Float2_1.y

(%Float4_0.w * %Float2_0.y) equals to (%Float4_1.w * %Float2_1.y), they should be reassociated to a common factor

The upstream change can't identify this common factor because DXC doesn't know (%Float4_0.w, %Float4_1.w) and (%Float2_0.y, %Float2_1.y) are redundant when running Reassociate pass. Those redundancies will be eliminated in GVN pass.

For DXC can identify more common factors, this PR will aggressively run Reassociate pass again after GVN pass and then run GVN pass again to remove the redundancies generared in this run of Reassociate pass.

Changing the order of floating point operations causes the precision issue. In case some shaders get unexpected results due to this PR, use "-opt-disable aggressive-reassociation" to disable this PR and roll back.

This is part 3 of the fix for #6593.
lizhengxing added a commit that referenced this pull request May 17, 2024
This PR (#6598) pulls the upstream global reassociation algorithm change in DXC and can reduce redundant calculations obviously.

However, from the testing result of a large offline suite of shaders, some shaders got worse compilation results and couldn't benefit from this upstream change.

This PR adds a flag for the upstream global reassociation change. It would be easier to roll back if a shader get worse compilation result due to this upstream change.

This is part 2 of the fix for #6593.
lizhengxing added a commit that referenced this pull request May 17, 2024
This PR (#6598) pulls the upstream global reassociation algorithm change in DXC and can reduce redundant calculations obviously.

However, from the testing result of a large offline suite of shaders, some shaders got worse compilation results and couldn't benefit from this upstream change.

This PR adds a flag for the upstream global reassociation change. It would be easier to roll back if a shader get worse compilation result due to this upstream change.

This is part 2 of the fix for #6593.
lizhengxing added a commit that referenced this pull request May 17, 2024
Although DXC applied the upstream change, Reassociate: add global reassociation algorithm (llvm/llvm-project@b8a330c) in this PR (#6598), it still might overlook some obvious common factors.

One case has been observed is:
  %Float4_0   = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 1)
  %Float4_0.w = extractvalue %dx.types.CBufRet.f32 %Float4_0, 3
  %Float2_0   = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 0)
  %Float2_0.y = extractvalue %dx.types.CBufRet.f32 %Float2_0, 1

  /* %Float4_1 is redundant with %Float4_0 since they invokes cbufferLoadLegacy with same parameters */
  %Float4_1   = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 1)

  /* %Float4_1.w is redundant with %Float4_0.w */
  %Float4_1.w = extractvalue %dx.types.CBufRet.f32 %Float4_1, 3

  /* %Float2_1 is redundant with %Float2_0 since they invokes cbufferLoadLegacy with same parameters */
  %Float2_1   = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 0)

  /* %Float2_1.y is redundant with %Float2_0.y */
  %Float2_1.y = extractvalue %dx.types.CBufRet.f32 %Float2_1, 1

  ....
  %11 = fmul fast float %Float4_0.w, %10
  %12 = fmul fast float %11, %Float2_0.y
  ....
  %14 = fmul fast float %Float4_1.w, %13
  %15 = fmul fast float %14, %Float2_1.y

(%Float4_0.w * %Float2_0.y) equals to (%Float4_1.w * %Float2_1.y), they should be reassociated to a common factor

The upstream change can't identify this common factor because DXC doesn't know (%Float4_0.w, %Float4_1.w) and (%Float2_0.y, %Float2_1.y) are redundant when running Reassociate pass. Those redundancies will be eliminated in GVN pass.

For DXC can identify more common factors, this PR will aggressively run Reassociate pass again after GVN pass and then run GVN pass again to remove the redundancies generared in this run of Reassociate pass.

Changing the order of floating point operations causes the precision issue. In case some shaders get unexpected results due to this PR, use "-opt-disable aggressive-reassociation" to disable this PR and roll back.

This is part 3 of the fix for #6593.
This PR pulls the upstream change, Reassociate: add global reassociation algorithm (llvm/llvm-project@b8a330c), into DXC with miminal changes.

For the code below:
  foo = (a * b) * c
  bar = (a * d) * c

As the upstream change states, it can identify the a*c is a common factor and redundant.

This is part 1 of the fix for #6593.
@lizhengxing lizhengxing force-pushed the zhengxingli/common-factor-optimization-upstream branch from 7776de8 to e3a604a Compare May 21, 2024 17:31
@lizhengxing lizhengxing merged commit 6f9c107 into main May 21, 2024
13 checks passed
lizhengxing added a commit that referenced this pull request May 21, 2024
This PR (#6598) pulls the upstream global reassociation algorithm change in DXC and can reduce redundant calculations obviously.

However, from the testing result of a large offline suite of shaders, some shaders got worse compilation results and couldn't benefit from this upstream change.

This PR adds a flag for the upstream global reassociation change. It would be easier to roll back if a shader get worse compilation result due to this upstream change.

This is part 2 of the fix for #6593.
lizhengxing added a commit that referenced this pull request May 21, 2024
)

This PR (#6598)
pulls the upstream global reassociation algorithm change in DXC and can
reduce redundant calculations obviously.

However, from the testing result of a large offline suite of shaders,
some shaders got worse compilation results and couldn't benefit from
this upstream change.

This PR adds a flag for the upstream global reassociation change. It
would be easier to roll back if a shader get worse compilation result
due to this upstream change.

This is part 2 of the fix for #6593.
lizhengxing added a commit that referenced this pull request May 21, 2024
Although DXC applied the upstream change, Reassociate: add global reassociation algorithm (llvm/llvm-project@b8a330c) in this PR (#6598), it still might overlook some obvious common factors.

One case has been observed is:
  %Float4_0   = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 1)
  %Float4_0.w = extractvalue %dx.types.CBufRet.f32 %Float4_0, 3
  %Float2_0   = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 0)
  %Float2_0.y = extractvalue %dx.types.CBufRet.f32 %Float2_0, 1

  /* %Float4_1 is redundant with %Float4_0 since they invokes cbufferLoadLegacy with same parameters */
  %Float4_1   = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 1)

  /* %Float4_1.w is redundant with %Float4_0.w */
  %Float4_1.w = extractvalue %dx.types.CBufRet.f32 %Float4_1, 3

  /* %Float2_1 is redundant with %Float2_0 since they invokes cbufferLoadLegacy with same parameters */
  %Float2_1   = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 0)

  /* %Float2_1.y is redundant with %Float2_0.y */
  %Float2_1.y = extractvalue %dx.types.CBufRet.f32 %Float2_1, 1

  ....
  %11 = fmul fast float %Float4_0.w, %10
  %12 = fmul fast float %11, %Float2_0.y
  ....
  %14 = fmul fast float %Float4_1.w, %13
  %15 = fmul fast float %14, %Float2_1.y

(%Float4_0.w * %Float2_0.y) equals to (%Float4_1.w * %Float2_1.y), they should be reassociated to a common factor

The upstream change can't identify this common factor because DXC doesn't know (%Float4_0.w, %Float4_1.w) and (%Float2_0.y, %Float2_1.y) are redundant when running Reassociate pass. Those redundancies will be eliminated in GVN pass.

For DXC can identify more common factors, this PR will aggressively run Reassociate pass again after GVN pass and then run GVN pass again to remove the redundancies generared in this run of Reassociate pass.

Changing the order of floating point operations causes the precision issue. In case some shaders get unexpected results due to this PR, use "-opt-disable aggressive-reassociation" to disable this PR and roll back.

This is part 3 of the fix for #6593.
lizhengxing added a commit that referenced this pull request May 21, 2024
Although DXC applied the upstream change, Reassociate: add global
reassociation algorithm
(llvm/llvm-project@b8a330c) in this PR
(#6598), it still
might overlook some obvious common factors.

One case has been observed is:
```
  %Float4_0   = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 1)
  %Float4_0.w = extractvalue %dx.types.CBufRet.f32 %Float4_0, 3
  %Float2_0   = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 0)
  %Float2_0.y = extractvalue %dx.types.CBufRet.f32 %Float2_0, 1

  /* %Float4_1 is redundant with %Float4_0 since they invokes cbufferLoadLegacy with same parameters */
  %Float4_1   = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 1)

  /* %Float4_1.w is redundant with %Float4_0.w */
  %Float4_1.w = extractvalue %dx.types.CBufRet.f32 %Float4_1, 3 

  /* %Float2_1 is redundant with %Float2_0 since they invokes cbufferLoadLegacy with same parameters */
  %Float2_1   = call %dx.types.CBufRet.f32 @dx.op.cbufferLoadLegacy.f32(i32 59, %dx.types.Handle %1, i32 0)

  /* %Float2_1.y is redundant with %Float2_0.y */
  %Float2_1.y = extractvalue %dx.types.CBufRet.f32 %Float2_1, 1

  ....
  %11 = fmul fast float %Float4_0.w, %10
  %12 = fmul fast float %11, %Float2_0.y
  ....
  %14 = fmul fast float %Float4_1.w, %13
  %15 = fmul fast float %14, %Float2_1.y 

(%Float4_0.w * %Float2_0.y) equals to (%Float4_1.w * %Float2_1.y), they should be reassociated to a common factor

```
The upstream change can't identify this common factor because DXC
doesn't know (%Float4_0.w, %Float4_1.w) and (%Float2_0.y, %Float2_1.y)
are redundant when running Reassociate pass. Those redundancies will be
eliminated in GVN pass.

For DXC can identify more common factors, this PR will aggressively run
Reassociate pass again after GVN pass and then run GVN pass again to
remove the redundancies generared in this run of Reassociate pass.

Changing the order of floating point operations causes the precision
issue. In case some shaders get unexpected results due to this PR, use
"-opt-disable aggressive-reassociation" to disable this PR and roll
back.

This is part 3 of the fix for
#6593.
lizhengxing added a commit that referenced this pull request May 21, 2024
This PR pulls the upstream change, Reassociate: add global reassociation
algorithm
(llvm/llvm-project@b8a330c),
into DXC with miminal changes.

For the code below:
  foo = (a * b) * c
  bar = (a * d) * c

As the upstream change states, it can identify the a*c is a common
factor and redundant.

This is part 1 of the fix for #6593.

(cherry picked from commit 6f9c107)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

5 participants