FP8 rowwise scaling #125204

drisspg · 2024-04-30T00:31:51Z

Summary

This pull request introduces an fp8 row-scaling kernel as an optional implementation for scaled_mm. The kernel selection is based on the scaling tensors of the inputs. For inputs x and y of shape [M, K] and [K, N] respectively, the following conditions must be met:

x's scale should be a 1-dimensional tensor of length M.
y's scale should be a 1-dimensional tensor of length N.

It's important to note that this kernel is not called "rowwise, columnwise" scaling because, although the scales for y are semantically along its columns, this implementation only supports the TN format. This means the scaling is along the faster-moving dimension, or the "row".

The following two PRs were required to enable local builds:

Todo

We still do not build our Python wheels with this architecture.

@ptrblck @malfet, should we replace sm_90 with sm_90a?

The NVRTC TMA shadowing feels wrong, but I a not sure the right way to spoof the symbol for this compilation unit:
https://github.com/pytorch/pytorch/pull/125204/files#r1586986954

ifdef

I tried to use : #if !defined(USE_ROCM) && defined(CUDA_VERSION) && CUDA_VERSION >= 12000 && \ defined(__CUDA_ARCH__) && __CUDA_ARCH__ > 900 to gate the building of the kernel. I was having a hell of a time with this.. so I am not really sure the right way to do this

pytorch-bot · 2024-04-30T00:31:53Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125204

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

⏳ 1 Pending, 1 Unrelated Failure

As of commit 73b3a39 with merge base 11c2d12 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 4, 5, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
test_public_bindings.py::TestPublicBindings::test_correct_module_names

This comment was automatically generated by Dr. CI and updates every 15 minutes.

drisspg · 2024-05-02T02:06:20Z

aten/src/ATen/native/cuda/RowwiseScaledMM.cu

+
+#include <c10/core/ScalarType.h>
+#include <cutlass/trace.h>
+// TODO we arent actually linking against cudaruntime, probably need to get this


removing this header include appears to work for me locally

drisspg · 2024-05-02T02:08:40Z

aten/src/ATen/native/cuda/RowwiseScaledMM.cu

+#define BUILD_ROWWISE_FP8_KERNEL
+#endif
+
+CUresult CUDAAPI cuTensorMapEncodeTiled(CUtensorMap *tensorMap, CUtensorMapDataType tensorDataType, cuuint32_t tensorRank, void *globalAddress, const cuuint64_t *globalDim, const cuuint64_t *globalStrides, const cuuint32_t *boxDim, const cuuint32_t *elementStrides, CUtensorMapInterleave interleave, CUtensorMapSwizzle swizzle, CUtensorMapL2promotion l2Promotion, CUtensorMapFloatOOBfill oobFill) {


when trying to mark this static

me/drisspg/meta/pytorch/aten/src/ATen/native/cuda/RowwiseScaledMM.cu:34:17: error: ‘CUresult cuTensorMapEncodeTiled(CUtensorMap*, CUtensorMapDataType, cuuint32_t, void*, const cuuint64_t*, const cuuint64_t*, const cuuint32_t*, const cuuint32_t*, CUtensorMapInterleave, CUtensorMapSwizzle, CUtensorMapL2promotion, CUtensorMapFloatOOBfill)’ was declared ‘extern’ and later ‘static’ [-fpermissive]

jianyuh · 2024-05-07T00:27:46Z

aten/src/ATen/native/cuda/RowwiseScaledMM.cu

+}
+
+namespace at::cuda::detail {
+void f8f8bf16_rowwise(


We recently open sourced this op in FBGEMM ( https://github.com/pytorch/FBGEMM/blob/39b655a5ad3933042fbec439d00894068f453932/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions.cu#L1110), originally from @jwfromm . Do you plan to also add the related quantize routine in PyTorch core (e.g., https://github.com/pytorch/FBGEMM/blob/39b655a5ad3933042fbec439d00894068f453932/fbgemm_gpu/experimental/gen_ai/src/quantize/quantize.cu#L919 ) ?

I wasnt currently planning on adding accompanying quantization ops, as we would likely rely on inductor to generate this casting code.

drisspg force-pushed the add-row-wise-scaling branch 7 times, most recently from 54a84cc to dac6a96 Compare May 2, 2024 02:00

drisspg commented May 2, 2024

View reviewed changes

drisspg mentioned this pull request May 2, 2024

Allow sm90a in TORCH_CUDA_ARCH_LIST #125413

Closed

drisspg force-pushed the add-row-wise-scaling branch from dac6a96 to 110261b Compare May 2, 2024 18:31

drisspg requested a review from malfet May 2, 2024 19:27

jianyuh reviewed May 7, 2024

View reviewed changes

drisspg force-pushed the add-row-wise-scaling branch 6 times, most recently from 77638d7 to 7d9bc17 Compare May 16, 2024 03:41

Enable fp8 rowwise scaling kernel on cuda

73b3a39

drisspg force-pushed the add-row-wise-scaling branch from 7d9bc17 to 73b3a39 Compare May 20, 2024 21:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FP8 rowwise scaling #125204

FP8 rowwise scaling #125204

drisspg commented Apr 30, 2024 •

edited

pytorch-bot bot commented Apr 30, 2024 •

edited

drisspg May 2, 2024

drisspg May 2, 2024

jianyuh May 7, 2024

drisspg May 9, 2024

FP8 rowwise scaling #125204

Are you sure you want to change the base?

FP8 rowwise scaling #125204

Conversation

drisspg commented Apr 30, 2024 • edited

Summary

Todo

ifdef

pytorch-bot bot commented Apr 30, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125204

⏳ 1 Pending, 1 Unrelated Failure

drisspg May 2, 2024

Choose a reason for hiding this comment

drisspg May 2, 2024

Choose a reason for hiding this comment

jianyuh May 7, 2024

Choose a reason for hiding this comment

drisspg May 9, 2024

Choose a reason for hiding this comment

drisspg commented Apr 30, 2024 •

edited

pytorch-bot bot commented Apr 30, 2024 •

edited