Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allocation order propagation for matmul/linear #2198

Open
jjsjann123 opened this issue May 3, 2024 · 4 comments
Open

allocation order propagation for matmul/linear #2198

jjsjann123 opened this issue May 3, 2024 · 4 comments

Comments

@jjsjann123
Copy link
Collaborator

issue was raised by @jacobhinkle

We would like allocation order inference to populate proper allocation domain for inputs to matmul/linear ops.

i.e.

tv0 = fusion.define_tensor(...)
tv1 = fusion.define_tensor(...)
// magic operations that produces `tv0_derived` and `tv1_derived`

tv_out = fusion.ops.matmul(tv0_derived, tv1_derived)
// ...

with a vanilla fusion, tv0_derived and tv1_derived will have an empty allocation domain. This is not ideal, imagining if tv0 and tv1 comes in with a non-trivial allocation_domain.

The ask here is:

  1. we would want allocation order transpose to infer allocation order on tv0_derived and tv1_derived and populate it properly from their producers.
  2. the target of the propagation is recognized simply as inputs to matmul/linear operation. (or other pattern matching that we want to apply).
  3. We do NOT need to populate the allocation_order for tv_out, which should be better done by the scheduler.
@jacobhinkle
Copy link
Collaborator

jacobhinkle commented May 5, 2024

Since scheduler is free to determine some output stride orders, does that mean we cannot really fully propagate it before segmentation? What if this was done during segmentation instead like when we get heuristics we could also query the output allocation domains. If we do that in topo order we could have the proper alloc dom available when computing heuristics/during scheduling.

@jjsjann123
Copy link
Collaborator Author

Since scheduler is free to determine some output stride orders, does that mean we cannot really fully propagate it before segmentation?

The challenge here is to: 1. identify the boundary of each segments before the segmentation happened; 2. known how each segments' IO tensor would be mutated into different memory format by its schedulers.

What if this was done during segmentation instead like when we get heuristics we could also query the output allocation domains. If we do that in topo order we could have the proper alloc dom available when computing heuristics/during scheduling.

IIUC, this is suggesting that each scheduler's canSchedule would also consider updating an empty alloc dom of its output TensorView and properly giving that to the next segment? Yeah that would be a good to have thing as well.
With that said, having a global pass to coordinate across each fusion segments seems reasonable to have.

@jjsjann123
Copy link
Collaborator Author

Question for @jacobhinkle , is the ask above what you were expecting from allocation order inference for now?

The ask here is:

  1. we would want allocation order transpose to infer allocation order on tv0_derived and tv1_derived and populate it properly from their producers.
  2. the target of the propagation is recognized simply as inputs to matmul/linear operation. (or other pattern matching that we want to apply).
  3. We do NOT need to populate the allocation_order for tv_out, which should be better done by the scheduler.

@jacobhinkle
Copy link
Collaborator

IIUC, this is suggesting that each scheduler's canSchedule would also consider updating an empty alloc dom of its output TensorView and properly giving that to the next segment?

Something like that, yes. For example in #2169 we might want to temporarily disallow matmul segments with a bias whose stride order does not match the output's. At minimum though, we'd want to have this available during proposeHeuristics and SchedulerEntry::makeEntry, which happens after segmentation is done and runtime order is determined. That way we'd be able to reliably infer the layout of matmuls based on input strides.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants