[BACKEND] Add support to convert INT8 MMAV2 accumulator layout to dot_operand layout #3595

tongyuantongyu · 2024-04-07T17:11:21Z

Partial fix of #3580. Resolved the INT8 layoutC -> INT8 layoutA case.

Port MMAV3's reg shuffling to support MMAV2 layout.
Simplify the shuffling logic.

…_operand layout

ThomasRaoux · 2024-04-10T02:00:36Z

I haven't reviewed in details as it is failing the tests. I'm not sure I understand why the code sequence for mma to dot_operand(fp8) has changed.

The FP16 case is a bit tough. convert_layout only knows that the tensor is in #mma layout, but has no idea which MMA exactly. #mma layouts of different MMAs are different. Guessing from element type is not reliable, as user may (and for INT8/FP8 MMA, have to) cast between types.

I don't understand, could you give an example of what MMA format is different based on the type?

ThomasRaoux · 2024-04-08T16:59:36Z

test/Conversion/tritongpu_to_llvm_hopper.mlir

-// CHECK: prmt.b32
-// CHECK: prmt.b32


Why are those removed?

These two corresponding to the prmt-s using selectorEx0 and selectorEx1. These two patterns are simply selecting one Value of the two, so a select is sufficient and no prmt is required here.

ThomasRaoux · 2024-04-08T17:02:46Z

third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/ConvertLayoutOpToLLVM.cpp

+ rewriter.replaceOp(op, result);
+ }
+
+ void convert8BitsMMAV2To16BitsDotOperand(


let' not add it if it is not used and tested

tongyuantongyu · 2024-04-10T08:17:58Z

I don't understand, could you give an example of what MMA format is different based on the type?

You're indeed right. I was confused as there turns out to be another issue here (Detail reported in #3580 (comment)). Loading INT8 input and do both MMA in FP16 also get wrong result. convert8BitsMMAV2To16BitsDotOperand fixed the wrong order, and made me think the issue is MMA having different layouts.

tongyuantongyu · 2024-04-11T02:30:24Z

FAILED hopper/test_gemm.py::test_gemm[128-128-64-4-1-4096-1-1024-False-False-True-none-float32-False-3] - AssertionError: Tensor-likes are not close!

Mismatched elements: 1 / 4096 (0.0%)
Greatest absolute difference: 2.0 at index (289, 0) (up to 0.001 allowed)
Greatest relative difference: 2.0 at index (289, 0) (up to 0.01 allowed)

This failure seems flaky. I don't see a torch.manual_seed call in test_gemm.py, so maybe this is numerical instability with specific input value.

ThomasRaoux

I see a big performance regression in the fp8 flash attention tutorial with this patch (run python tutorials/06-fused-attention.py on h100 to see it). I'm not sure where this is coming from. Please fix it and I can take a deeper look at the changes.

ThomasRaoux · 2024-04-10T13:34:20Z

python/test/unit/language/test_core.py

@@ -3073,7 +3074,7 @@ def kernel(X, stride_xm, stride_xk, Y, stride_yk, stride_yn, W, stride_wn, strid
 z_tri = torch.as_strided(z_tri, (M, N), [1, M])

 if out_dtype == 'int8':
- out_dtype = tl.int8
+ out_dtype = tl.int32


that doesn't make sense. That will change behavior of existing tests. If we want to tests i32 dtype it can be set in the config?

Output type for INT8 MMA is always INT32 (https://github.com/openai/triton/blob/main/python/triton/language/semantic.py#L1358), and out_dtype is simply ignored, as opposed to FP MMA (https://github.com/openai/triton/blob/main/python/triton/language/semantic.py#L1367)

I changed it to tl.int32 here just for correctness. If you feel necessary, I can add a check that out_dtype must be tl.int32 for tl.int8 inputs, in this or a separate PR.

tongyuantongyu · 2024-04-11T16:24:08Z

I see a big performance regression in the fp8 flash attention tutorial with this patch

Sorry I don't have access to H100. I made an attempt to fix it, could you test if it fixes the regression? If it's still there, I'll try to revert all changes to MMAV3 part.

ThomasRaoux · 2024-04-11T17:33:00Z

I see a big performance regression in the fp8 flash attention tutorial with this patch

Sorry I don't have access to H100. I made an attempt to fix it, could you test if it fixes the regression? If it's still there, I'll try to revert all changes to MMAV3 part.

Thanks, that seems to fix it. I'll look more at the code sequence in a little bit to understand it.

[BACKEND] Add support to convert INT8 MMAV2 accumulator layout to dot…

aa9d90e

…_operand layout

tongyuantongyu requested review from Jokeren and ptillet as code owners April 7, 2024 17:11

tongyuantongyu added 4 commits April 8, 2024 20:47

Add MLIR tests

e464114

Refine test condition

4a3f982

Sync upstream/main

8567a02

Apply clang-format

06b19ce

ThomasRaoux reviewed Apr 10, 2024

View reviewed changes

tongyuantongyu added 3 commits April 10, 2024 10:04

Fix typo in MMAV3 shuffler

ac9a6d4

Remove unused 16B to 16B routine

1dd0dcf

Cleanup

8920609

tongyuantongyu requested a review from ThomasRaoux April 10, 2024 17:14

Reformat code

2007a11

ThomasRaoux reviewed Apr 11, 2024

View reviewed changes

Use simple packing

0a56c9a

Cleanup leftover 16bit code

4470f03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BACKEND] Add support to convert INT8 MMAV2 accumulator layout to dot_operand layout #3595

[BACKEND] Add support to convert INT8 MMAV2 accumulator layout to dot_operand layout #3595

tongyuantongyu commented Apr 7, 2024 •

edited

ThomasRaoux commented Apr 10, 2024

ThomasRaoux Apr 8, 2024

tongyuantongyu Apr 10, 2024

ThomasRaoux Apr 8, 2024

tongyuantongyu Apr 10, 2024

tongyuantongyu commented Apr 10, 2024

tongyuantongyu commented Apr 11, 2024

ThomasRaoux left a comment

ThomasRaoux Apr 10, 2024

tongyuantongyu Apr 11, 2024

tongyuantongyu commented Apr 11, 2024

ThomasRaoux commented Apr 11, 2024

[BACKEND] Add support to convert INT8 MMAV2 accumulator layout to dot_operand layout #3595

Are you sure you want to change the base?

[BACKEND] Add support to convert INT8 MMAV2 accumulator layout to dot_operand layout #3595

Conversation

tongyuantongyu commented Apr 7, 2024 • edited

ThomasRaoux commented Apr 10, 2024

ThomasRaoux Apr 8, 2024

Choose a reason for hiding this comment

tongyuantongyu Apr 10, 2024

Choose a reason for hiding this comment

ThomasRaoux Apr 8, 2024

Choose a reason for hiding this comment

tongyuantongyu Apr 10, 2024

Choose a reason for hiding this comment

tongyuantongyu commented Apr 10, 2024

tongyuantongyu commented Apr 11, 2024

ThomasRaoux left a comment

Choose a reason for hiding this comment

ThomasRaoux Apr 10, 2024

Choose a reason for hiding this comment

tongyuantongyu Apr 11, 2024

Choose a reason for hiding this comment

tongyuantongyu commented Apr 11, 2024

ThomasRaoux commented Apr 11, 2024

tongyuantongyu commented Apr 7, 2024 •

edited