-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduction scheduler does not handle allocation domain properly and trigger assert when reduction output has specified allocation domain #2202
Comments
Adding another python example for thunder einsum that fails when my simple allocation domain propagation is enabled. Note that this is only for my own reference for repro.
|
cc'ing @liqiangxl if this is something that you might be interested. 😜 |
We should get a couple reduction/normalization scheduler and iterate through permutation on allocation domain (from rfactor) for inputs/outputs. That should verify that schedulers are properly handling allocation domain. |
Highlight comment from #2200 (comment) regarding comprehensive test cases for allocation domain support in reduction scheduler. We'll want some representative end-2-end tests to validate our functional support. |
Could you add the error message here? |
The old assert happens here: Fuser/csrc/scheduler/utils.cpp Line 1051 in 5af7104
That one can be patched with #2201 . But afterwards, our vectorization is wrong since input tensor cannot be vectorized. |
So, with #2201, the issue is now with the vectorization analysis? |
That's what is failing right now, admittedly it is a problem. Though I don't think that's the root cause. #2201 just patched the assert, but the scheduler is still wrong... I think reduction scheduler should be looking at reduction inputs instead of its output to decide which reduction to use (inner vs outer), does that sound right to you as well? |
Admittedly, I need to refresh my memory about the reduction scheduler but why does it matter? Because the input and output may have different orderings of allocation domains? |
exactly and that would change how the read pattern is, which I think dictates which variant of reduction scheduler that we should use. |
I think it would be nice to have some simple concrete example. |
Sounds good. I'll get that in place to help with the discussion. |
Does this simpler example look better?
|
When would this happen in practice? |
^^^ That's a valid argument. Do we want to have reduction domains in allocation domain at all and what meaning do they carry? To keep the question simpler, we can assume that the order is propagate from inputs. How about this example
We can see here that the allocation order is is propagated from input to output. So maybe it makes more sense?! On ToT, this example triggers the same issue in |
Yeah, we don't allocate those domains, so it seems to make more sense to drop them from allocation domains. In this example, IIUC, both of the input and output just look like an inner reduction, so nothing inconsistent? |
You are right. The second example does not have the inconsistency. The problem with this one is that, our reductoin scheduler isn't handling allocation domain properly and that's where the assert happens. So really there're two problems:
For performance issue, we can look at the example below, these two reduction are identical in concept.
But Meanwhile, This is the simple logic our reduction scheduler uses to choose between inner vs outer reduction Fuser/csrc/scheduler/utils.cpp Lines 2224 to 2239 in 66b2a0a
FYI, the reference |
The examples make sense, but isn't the performance problem fixed by the allocation order propagation? |
Touche. We can discuss whether we want to prioritize the performance issue, maybe it's something we can wait until the issue shows up. I'm wondering how channels last vision models would behave with this (convolution followed by normalization). Allocation order propagation doesn't always fix the performance problem in practice. There could be user specify output format, then it's not something the propagation can change; Also the propagation was done at the whole fusion segment right now, so it could struggle to figure out the right TVs to propagate from/to. I'll follow up with the performance issue on a separate thread. So back to the original issue this one is about. Scheduler should at least be functional and we need to patch that. |
This sounds like a user specifies different allocation orderings between inputs and outputs. There doesn't seem to be any obviously right decision here, i.e., whether the input order or the output order should be adopted. I hope this is uncommon.
Is it fixed by #2201? |
It's not properly fixed. The cpp example in that PR still have wrong vectorization after the patch. |
refactored allocation order inference pass: * Instead of per operation propagation rule, we are now using IdModel mapping to map allocation domain of reference tensor to rfactor domain of target tensor. * Updated the inference API to allow specified sources and destinations for the propagation. ``` void inferenceAllocationOrder( Fusion* fusion, const std::vector<TensorView*>& srcs, const std::vector<TensorView*>& dsts); ``` * The propagation tried to keep the memory format of `dsts` closer to the `srcs` to simplify scheduling as well as facilitate vectorization. It works roughly as: * For each entry `dst`, among all its producers in `srcs`, we'll find the one with the most loop iter domain in its allocation domain as the reference `ref` * We try to map each iter domain in `dst`'s rfactor domain to `ref`'s allocation order domain and push those as the inner dimension in `dst`'s new allocation domain, while pushing unmapped iter domains as outer dimensions. * I have to put in a WAR for the mapping logic for now, since reduction scheduler is struggling with permuted output. See issue #2202. The WAR is simply to preserve the existing position of reduction iter domain in rfactor the same as it would be in its new allocation domain. This WAR is supposed to be removed at a later point once we fixed reduction scheduler. I kept both code path in the PR for easier future cleanup. --------- Co-authored-by: Naoya Maruyama <[email protected]> Co-authored-by: Jingyue Wu <[email protected]>
Found an repro here:
I tried to patch it with just fixing the reduction properties: #2201 , but that hit another issue with vectorization.
I realized that the reduction scheduler assumes the inputs to reduction operation to have the same allocation domain as with its output, which is not true.
IIUC, inner reduction means that the reduction dimension contains the inner most iter domain on inputs, rather than its output.
The text was updated successfully, but these errors were encountered: