-
Notifications
You must be signed in to change notification settings - Fork 24.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a SIMD (AVX2) optimised vector distance function for int7 on x64 #108088
Conversation
Pinging @elastic/es-search (Team:Search) |
Pinging @elastic/es-core-infra (Team:Core/Infra) |
Hi @ldematte, I've created a changelog YAML for you. |
I'm going to get some perf number from the jmh benchmarks in the Elasticsearch repo. |
@ldematte what are the vector dimensions for this benchmark? |
It's like you are reading my mind :D |
An update: following the hint on "uint7", I was able to write different flavours of code.
Legend: It seems that for unit7 we could be good with The story is a bit different for int8, where VNNI makes a difference. I will try to figure out which processor supports which instruction set -- it's really harder than it should be! These results were run on a Amazon C7i instance (Sapphire Rapid). I will repeat them on a C7a instance too (AMD Zen4). |
Another update, for sqr:
So it seems there is a 2x for sqr too, sticking to AVX2 |
For completeness, the same micro benchmarks on AMD Zen4:
|
Last update: processor architecture We use the following instance types: GCP - N2D (n2d-highmem-8)N2D are AMD Milan (Zen2) or AMD Rome (Zen3). Azure - Lsv3 series (Standard_L8s_v3)Lsv3 are on Intel Ice Lake, supporting both AVX-512 and AVX512_VNNI/AVX512VL. Finally, a small table
AVX-VNNI is a "Backport" of AVX512_VNNI - so it's likely more recent. On server processors, AVX512_VNNI seems more diffused: where AVX-512 is present, AVX512_VNNI/AVX512VL is supported too. For For now, with |
…asticsearch into native-vec-linux-x64
…rch into native-vec-linux-x64
JMH benckmarks show a less brilliant picture:
We do have an improvement over Lucene (around 1.7x for dot, 2x for sqr), but not the 3x that the C micro-benchmarks suggested we might have. EDIT: this is Lucene AVX-512 against Native AVX2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -121,7 +121,7 @@ static inline int32_t sqr7u_inner(int8_t *a, int8_t *b, size_t dims) { | |||
EXPORT int32_t sqr7u(int8_t* a, int8_t* b, size_t dims) { | |||
int32_t res = 0; | |||
int i = 0; | |||
if (i > SQR7U_STRIDE_BYTES_LEN) { | |||
if (dims > SQR7U_STRIDE_BYTES_LEN) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No prob! It drove me mad yesterday because I was seeing it running so much slower :)
I understood what was going on only after disassembling the function and seeing no inline vector code nor function call...
Thanks for updating the title!
Co-authored-by: Chris Hegarty <[email protected]>
@elasticmachine update branch |
The only failure of CI is for |
This PR is the complement of #106133 for x64 architectures.
While the vec native library can been compiled as-is on Windows, MacOS and Linux, we deliberately chose to concentrate on Linux x64 for this first PR, in order to reduce the burden to test, validate and debug across multiple platforms.
The implementation uses AVX2 only instructions - faster implementations are possible with AVX-512 or VNNI, but we aimed at maximum compatibility here. Even in this case, preliminary micro-benchmarks (against a simple C scalar implementation) show good speedup: