-
Notifications
You must be signed in to change notification settings - Fork 438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AVX-512 FP16 implementation of halfvec distance functions #531
base: master
Are you sure you want to change the base?
Conversation
@nathan-bossart Would love your feedback on this one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performance results will be shared soon.
Looking forward to these!
src/halfutils.c
Outdated
#ifdef HAVE_AVX512FP16 | ||
TARGET_XSAVE static bool | ||
SupportsAvx512Fp16() | ||
{ | ||
unsigned int exx[4] = {0, 0, 0, 0}; | ||
unsigned int feature = (1 << 23); | ||
|
||
#if defined(HAVE__GET_CPUID) | ||
__get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]); | ||
#elif defined(HAVE__CPUID) | ||
__cpuid(exx, 7, 0); | ||
#endif | ||
|
||
return (exx[3] & feature) == feature; | ||
} | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is missing a couple steps, such as checking for osxsave and verifying the ZMM registers are enabled. See SupportsAvx512Popcount()
for an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reference. I'll add those checks (OSXSAVE and XCR0 control register).
src/halfutils.c
Outdated
for (; i < dim; i++) | ||
distance += HalfToFloat4(ax[i]) * HalfToFloat4(bx[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this auto-vectorized? (Same question for HalfvecL2SquaredDistanceAvx512Fp16()
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked L2SquaredDistance and InnerProduct, and it is using AVX scalar instructions, at least with gcc-12. We'll try masked vector instructions to handle the loop remainder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest update includes masked vector instructions for the loop remainder.
src/halfutils.c
Outdated
#ifdef HAVE_AVX512FP16 | ||
if (SupportsAvx512Fp16()) | ||
{ | ||
HalfvecL2SquaredDistance = HalfvecL2SquaredDistanceAvx512Fp16; | ||
HalfvecInnerProduct = HalfvecInnerProductAvx512Fp16; | ||
HalfvecCosineSimilarity = HalfvecCosineSimilarityAvx512Fp16; | ||
} | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: This might not need to be nested in the HALFVEC_DISPATCH
block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. Currently, it is taking advantage of the OSXSAVE check included with the other features, but I'll separate that.
I'll kick off some local benchmark runs to see the diffs. I have a r7i at the ready. |
@lucagiac81 I'm having issues compiling on an EC2 r7i. This is using gcc12 and clang-15. Here is some truncated output:
|
@jkatz I think clang is not applying I tested on an m7i instance (where
With gcc-12.3, and I got no errors in all cases. Can you try adding |
e390649
to
85ba2dc
Compare
Rebased on latest master |
src/halfutils.c
Outdated
SupportsAvx512Fp16() | ||
{ | ||
unsigned int exx[4] = {0, 0, 0, 0}; | ||
unsigned int feature = (1 << 23); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit. feature can be defined using #DEFINE CPU_FEATURE_AVX512FP16
src/halfutils.c
Outdated
__cpuid(exx, 7, 0); | ||
#endif | ||
|
||
/* Check OS supports XSAVE */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit. update comment to reflect OSXSAVE
src/halfutils.c
Outdated
return false; | ||
|
||
/* Check XMM, YMM, and ZMM registers are enabled */ | ||
if ((_xgetbv(0) & 0xe6) != 0xe6) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nathan-bossart shouldn't this be _xgetbv(0) & 0xe6) == 0xe6 ? Similar comment on L187 in bitutils.c per the discussion [0]
[0] : https://www.postgresql.org/message-id/20240418210158.GA3776258%40nathanxps13
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks alright as-is to me. If this check fails, we return false, so !=
looks correct.
85ba2dc
to
915d6eb
Compare
While collecting data with ANN benchmarks, we noticed a degradation in recall for some datasets (such as sift-128) when computing distances in half precision. Other datasets (such as gist-960) are not affected, and recall is matched to the existing distance functions. The existing functions (*F16c) first convert halfvec elements to single precision and execute the distance computation in single precision. So, enabling the FP16 distance functions may not be desirable in all cases. The latest update to the PR provides two implementations of the distance functions with AVX-512: one using single precision and one using half precision.
|
@lucagiac81 Thanks for the continued work. Per @nathan-bossart comment earlier, it'd be helpful to see the actual performance results. I'll try to get this to build again - last I checked I didn't have avx512fp16 available on my instance class. |
Here are some initial results
With the gist-960-euclidean dataset, so far we observe
It'd be great if you could reproduce these numbers with your setup. Please let me know if you still run into compilation issues. We'd also like to collect data with dbpedia-openai-1000k-angular as well (higher dimensions, different distance metric) , but we're running into a 403 error when downloading the dataset (similar to this report). Do you have any advice on how to run with that dataset? |
This PR adds implementations of halfvec distance functions based on the AVX-512 FP16 instruction set. The instruction set was introduced with Intel 4th Gen Intel® Xeon® Scalable processors. It supports 32x FP16 operations per instruction with 512-bit registers.
Compiler support for the new instructions was added in gcc-12 and clang-14. Those versions are minimum requirements for the AVX-512 FP16 functions to be compiled (controlled by conditional compilation). Support for the instruction set is also detected at runtime using CPUID. If not supported, the existing default or F16c functions are used.
Building was tested with
Execution of a binary compiled with gcc-12 (which includes the AVX-512 FP16 functions) was tested on
There is one open question regarding CI with GitHub Actions. The AVX-512 FP16 functions will not be tested unless a runner supports the instruction set.
Performance results will be shared soon.