Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a SIMD (AVX2) optimised vector distance function for int7 on x64 #108088

Merged
merged 20 commits into from
May 10, 2024

Conversation

ldematte
Copy link
Contributor

@ldematte ldematte commented Apr 30, 2024

This PR is the complement of #106133 for x64 architectures.
While the vec native library can been compiled as-is on Windows, MacOS and Linux, we deliberately chose to concentrate on Linux x64 for this first PR, in order to reduce the burden to test, validate and debug across multiple platforms.

The implementation uses AVX2 only instructions - faster implementations are possible with AVX-512 or VNNI, but we aimed at maximum compatibility here. Even in this case, preliminary micro-benchmarks (against a simple C scalar implementation) show good speedup:

---------------------------------------------------------
Benchmark               Time             CPU   Iterations
---------------------------------------------------------
BM_dot8_scalar        485 ns          485 ns      1413763
BM_dot8_vec          31.1 ns         31.1 ns     22385018
BM_sqr8_scalar        511 ns          511 ns      1362462
BM_sqr8_vec          35.0 ns         35.0 ns     19864876

@ldematte ldematte requested review from a team as code owners April 30, 2024 12:43
@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label v8.15.0 labels Apr 30, 2024
@ldematte ldematte added :Core/Infra/Core Core issues without another label :Search/Search Search-related issues that do not fall into other categories and removed needs:triage Requires assignment of a team area label labels Apr 30, 2024
@elasticsearchmachine elasticsearchmachine added Team:Core/Infra Meta label for core/infra team Team:Search Meta label for search team labels Apr 30, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@ldematte ldematte added >enhancement test-windows Trigger CI checks on Windows test-arm Pull Requests that should be tested against arm agents and removed Team:Core/Infra Meta label for core/infra team Team:Search Meta label for search team labels Apr 30, 2024
@elasticsearchmachine elasticsearchmachine added Team:Core/Infra Meta label for core/infra team Team:Search Meta label for search team labels Apr 30, 2024
@elasticsearchmachine
Copy link
Collaborator

Hi @ldematte, I've created a changelog YAML for you.

@ChrisHegarty
Copy link
Contributor

I'm going to get some perf number from the jmh benchmarks in the Elasticsearch repo.

@benwtrent
Copy link
Member

Even in this case, preliminary micro-benchmarks (against a simple C scalar implementation) show good speedup:

@ldematte what are the vector dimensions for this benchmark?

@ldematte
Copy link
Contributor Author

ldematte commented May 2, 2024

for Lucene's byte quantization, we can guarantee we won't have -128 as one of the values, does this change your implementation any? It will only output values [0, 127] when doing "int8" (really int7).

It's like you are reading my mind :D
My original implementation, which is faster (roughly 2x IIRC) was using unsigned, but had problems exactly with -128 (you know, because there is no +128 :/ ).
I'll dig that out and test it with the [0, 127] range!

@ldematte
Copy link
Contributor Author

ldematte commented May 3, 2024

An update: following the hint on "uint7", I was able to write different flavours of code.
I have many variants :) and benchmarked them all. There is promise!

-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
BM_dot8_scalar                381 ns          381 ns      1846848
BM_dot8s_avx2                53.1 ns         53.1 ns     13098322 <-- this is the one in the PR
BM_dot7u_avx2                14.3 ns         14.3 ns     49110723
BM_dot7u_avx2_w              12.8 ns         12.8 ns     55852738
BM_dot8s_avx512              29.2 ns         29.2 ns     24490561
BM_dot7u_avx512              11.6 ns         11.6 ns     60261825
BM_dot7u_avx512_w            11.0 ns         11.0 ns     64703414
BM_dot8s_vnni                27.1 ns         27.1 ns     25831045
BM_dot7u_vnni                17.4 ns         17.4 ns     40206390
BM_dot7u_vnni_w              10.4 ns         10.4 ns     67620280
BM_dot8s_avx512_vnni         19.4 ns         19.4 ns     36114430
BM_dot8s_avx512_vnni_w       14.8 ns         14.8 ns     48369553
BM_dot7u_avx512_vnni_w       10.5 ns         10.5 ns     64487145

Legend:
dot7u -> operands in [0, 127]; dot8s -> operands in [-128, 127]
avx2, avx512, vnni, avx512_vnni -> instruction set used
_w -> "wide" version (using double the stride - manual unroll basically)

It seems that for unit7 we could be good with dot7u_avx2_w (avx2 with double the stride). That should already give a boost of ~3.7x wrt the version in this PR, pulling us well ahead of Lucene.
"Plain" AVX-512 give us just ~4.5x; it's still good, but maybe not so much as to make it strictly necessary.

The story is a bit different for int8, where VNNI makes a difference. I will try to figure out which processor supports which instruction set -- it's really harder than it should be!

These results were run on a Amazon C7i instance (Sapphire Rapid). I will repeat them on a C7a instance too (AMD Zen4).

@ldematte
Copy link
Contributor Author

ldematte commented May 3, 2024

Another update, for sqr:

-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
BM_sqr8s_scalar               378 ns          378 ns      1852639
BM_sqr8s_avx2                36.2 ns         36.2 ns     19343113 <-- currently in the PR
BM_sqr8s_avx2_w              36.2 ns         36.2 ns     19337344 
BM_sqr7u_avx2                18.3 ns         18.3 ns     38367910
BM_sqr7u_avx2_w              14.0 ns         14.0 ns     50056590

So it seems there is a 2x for sqr too, sticking to AVX2

@ldematte
Copy link
Contributor Author

ldematte commented May 3, 2024

For completeness, the same micro benchmarks on AMD Zen4:

-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
BM_dot8_scalar                565 ns          565 ns      1238884
BM_dot8s_avx2                31.5 ns         31.5 ns     22163618
BM_dot7u_avx2                11.8 ns         11.8 ns     59135276
BM_dot7u_avx2_w              10.1 ns         10.1 ns     69159008
BM_dot8s_avx512              22.3 ns         22.3 ns     31300352
BM_dot7u_avx512              9.99 ns         9.99 ns     70275104
BM_dot7u_avx512_w            9.95 ns         9.95 ns     70140137
BM_dot8s_vnni                28.0 ns         28.0 ns     24998570
BM_dot7u_vnni                15.3 ns         15.3 ns     45628528
BM_dot7u_vnni_w              9.80 ns         9.80 ns     71438099
BM_dot8s_avx512_vnni         17.7 ns         17.7 ns     39746725
BM_dot8s_avx512_vnni_w       14.1 ns         14.1 ns     49663030
BM_dot7u_avx512_vnni_w       9.68 ns         9.67 ns     72473372
BM_sqr8s_scalar               565 ns          565 ns      1238257
BM_sqr8s_avx2                36.6 ns         36.6 ns     19149821
BM_sqr7u_avx2                19.1 ns         19.1 ns     36647360
BM_sqr7u_avx2_w              12.5 ns         12.5 ns     56013088
BM_sqr7u_avx512              11.6 ns         11.6 ns     60300342

@ldematte
Copy link
Contributor Author

ldematte commented May 3, 2024

Last update: processor architecture

We use the following instance types:

GCP - N2D (n2d-highmem-8)

N2D are AMD Milan (Zen2) or AMD Rome (Zen3).
AMD Milan (Zen2) or AMD Rome (Zen3) support only AVX2 unfortunately.
N2D was an "unfortunate" choice in this case; N2 instances are similar, but on Intel (Ice Lake/Cascade Lake). Those would support both AVX-512 and AVX512_VNNI/AVX512VL.

Azure - Lsv3 series (Standard_L8s_v3)

Lsv3 are on Intel Ice Lake, supporting both AVX-512 and AVX512_VNNI/AVX512VL.

Finally, a small table

Instructions| AMD                       | Intel
---------------------------------------------------------------------------------
AVX-512     | Zen4 (EPYC Genoa/Bergamo) | Cascade Lake (Xeon scalable 1st gen)
AVX512_VNNI | Zen4 (EPYC Genoa/Bergamo) | Cascade Lake
AVX-VNNI    | Zen5                      | Sapphire rapids (Xeon scalable 4th gen)

AVX-VNNI is a "Backport" of AVX512_VNNI - so it's likely more recent. On server processors, AVX512_VNNI seems more diffused: where AVX-512 is present, AVX512_VNNI/AVX512VL is supported too.

For int8, we might want to go with our fastest implementation with VNNI instructions.

For now, with uint7, seems like AVX2 would be OK. After #108243 is merged, I will update this PR with the latest code.

@ldematte
Copy link
Contributor Author

ldematte commented May 8, 2024

JMH benckmarks show a less brilliant picture:

Benchmark                                   (dims)   Mode  Cnt   Score   Error   Units
VectorScorerBenchmark.dotProductLucene        1024  thrpt    5  17.361 ± 0.098  ops/us
VectorScorerBenchmark.dotProductNative        1024  thrpt    5  29.249 ± 0.346  ops/us
VectorScorerBenchmark.dotProductScalar        1024  thrpt    5   2.774 ± 0.009  ops/us
VectorScorerBenchmark.squareDistanceLucene    1024  thrpt    5  14.446 ± 0.513  ops/us
VectorScorerBenchmark.squareDistanceNative    1024  thrpt    5  30.536 ± 0.225  ops/us
VectorScorerBenchmark.squareDistanceScalar    1024  thrpt    5   2.552 ± 0.014  ops/us

We do have an improvement over Lucene (around 1.7x for dot, 2x for sqr), but not the 3x that the C micro-benchmarks suggested we might have.
@ChrisHegarty and I think we should go through it with this anyway, and follow up with AVX-512 on a separate PR (if this gives us some speedup - AVX-512 is tricky, I get different speedups on different processors/different loads, due to throttling).

EDIT: this is Lucene AVX-512 against Native AVX2

@ChrisHegarty ChrisHegarty changed the title Add a SIMD (AVX2) optimised vector distance function for int8 on x64 Add a SIMD (AVX2) optimised vector distance function for int7 on x64 May 8, 2024
Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -121,7 +121,7 @@ static inline int32_t sqr7u_inner(int8_t *a, int8_t *b, size_t dims) {
EXPORT int32_t sqr7u(int8_t* a, int8_t* b, size_t dims) {
int32_t res = 0;
int i = 0;
if (i > SQR7U_STRIDE_BYTES_LEN) {
if (dims > SQR7U_STRIDE_BYTES_LEN) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, thank you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No prob! It drove me mad yesterday because I was seeing it running so much slower :)
I understood what was going on only after disassembling the function and seeing no inline vector code nor function call...
Thanks for updating the title!

@ldematte
Copy link
Contributor Author

ldematte commented May 8, 2024

@elasticmachine update branch

@ldematte
Copy link
Contributor Author

ldematte commented May 10, 2024

The only failure of CI is for 7.17.22 / bwc-snapshots-windows - this is a know issue (#108474) and reproduces on main without this PR, so I'm going to ignore it and merge

@ldematte ldematte merged commit 2e0f8d0 into elastic:main May 10, 2024
24 of 27 checks passed
@ldematte ldematte deleted the native-vec-linux-x64 branch May 10, 2024 09:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Core Core issues without another label >enhancement :Search/Search Search-related issues that do not fall into other categories Team:Core/Infra Meta label for core/infra team Team:Search Meta label for search team test-arm Pull Requests that should be tested against arm agents test-windows Trigger CI checks on Windows v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants