Add a SIMD (AVX2) optimised vector distance function for int7 on x64 #108088

ldematte · 2024-04-30T12:43:45Z

This PR is the complement of #106133 for x64 architectures.
While the vec native library can been compiled as-is on Windows, MacOS and Linux, we deliberately chose to concentrate on Linux x64 for this first PR, in order to reduce the burden to test, validate and debug across multiple platforms.

The implementation uses AVX2 only instructions - faster implementations are possible with AVX-512 or VNNI, but we aimed at maximum compatibility here. Even in this case, preliminary micro-benchmarks (against a simple C scalar implementation) show good speedup:

---------------------------------------------------------
Benchmark               Time             CPU   Iterations
---------------------------------------------------------
BM_dot8_scalar        485 ns          485 ns      1413763
BM_dot8_vec          31.1 ns         31.1 ns     22385018
BM_sqr8_scalar        511 ns          511 ns      1362462
BM_sqr8_vec          35.0 ns         35.0 ns     19864876

elasticsearchmachine · 2024-04-30T12:45:10Z

Pinging @elastic/es-search (Team:Search)

elasticsearchmachine · 2024-04-30T12:45:10Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

elasticsearchmachine · 2024-04-30T12:45:35Z

Hi @ldematte, I've created a changelog YAML for you.

ChrisHegarty · 2024-04-30T13:48:16Z

I'm going to get some perf number from the jmh benchmarks in the Elasticsearch repo.

benwtrent · 2024-04-30T16:43:52Z

Even in this case, preliminary micro-benchmarks (against a simple C scalar implementation) show good speedup:

@ldematte what are the vector dimensions for this benchmark?

ldematte · 2024-05-02T13:18:12Z

for Lucene's byte quantization, we can guarantee we won't have -128 as one of the values, does this change your implementation any? It will only output values [0, 127] when doing "int8" (really int7).

It's like you are reading my mind :D
My original implementation, which is faster (roughly 2x IIRC) was using unsigned, but had problems exactly with -128 (you know, because there is no +128 :/ ).
I'll dig that out and test it with the [0, 127] range!

ldematte · 2024-05-03T08:38:19Z

An update: following the hint on "uint7", I was able to write different flavours of code.
I have many variants :) and benchmarked them all. There is promise!

-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
BM_dot8_scalar                381 ns          381 ns      1846848
BM_dot8s_avx2                53.1 ns         53.1 ns     13098322 <-- this is the one in the PR
BM_dot7u_avx2                14.3 ns         14.3 ns     49110723
BM_dot7u_avx2_w              12.8 ns         12.8 ns     55852738
BM_dot8s_avx512              29.2 ns         29.2 ns     24490561
BM_dot7u_avx512              11.6 ns         11.6 ns     60261825
BM_dot7u_avx512_w            11.0 ns         11.0 ns     64703414
BM_dot8s_vnni                27.1 ns         27.1 ns     25831045
BM_dot7u_vnni                17.4 ns         17.4 ns     40206390
BM_dot7u_vnni_w              10.4 ns         10.4 ns     67620280
BM_dot8s_avx512_vnni         19.4 ns         19.4 ns     36114430
BM_dot8s_avx512_vnni_w       14.8 ns         14.8 ns     48369553
BM_dot7u_avx512_vnni_w       10.5 ns         10.5 ns     64487145

Legend:
dot7u -> operands in [0, 127]; dot8s -> operands in [-128, 127]
avx2, avx512, vnni, avx512_vnni -> instruction set used
_w -> "wide" version (using double the stride - manual unroll basically)

It seems that for unit7 we could be good with dot7u_avx2_w (avx2 with double the stride). That should already give a boost of ~3.7x wrt the version in this PR, pulling us well ahead of Lucene.
"Plain" AVX-512 give us just ~4.5x; it's still good, but maybe not so much as to make it strictly necessary.

The story is a bit different for int8, where VNNI makes a difference. I will try to figure out which processor supports which instruction set -- it's really harder than it should be!

These results were run on a Amazon C7i instance (Sapphire Rapid). I will repeat them on a C7a instance too (AMD Zen4).

ldematte · 2024-05-03T16:39:45Z

Another update, for sqr:

-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
BM_sqr8s_scalar               378 ns          378 ns      1852639
BM_sqr8s_avx2                36.2 ns         36.2 ns     19343113 <-- currently in the PR
BM_sqr8s_avx2_w              36.2 ns         36.2 ns     19337344 
BM_sqr7u_avx2                18.3 ns         18.3 ns     38367910
BM_sqr7u_avx2_w              14.0 ns         14.0 ns     50056590

So it seems there is a 2x for sqr too, sticking to AVX2

ldematte · 2024-05-03T16:50:30Z

For completeness, the same micro benchmarks on AMD Zen4:

-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
BM_dot8_scalar                565 ns          565 ns      1238884
BM_dot8s_avx2                31.5 ns         31.5 ns     22163618
BM_dot7u_avx2                11.8 ns         11.8 ns     59135276
BM_dot7u_avx2_w              10.1 ns         10.1 ns     69159008
BM_dot8s_avx512              22.3 ns         22.3 ns     31300352
BM_dot7u_avx512              9.99 ns         9.99 ns     70275104
BM_dot7u_avx512_w            9.95 ns         9.95 ns     70140137
BM_dot8s_vnni                28.0 ns         28.0 ns     24998570
BM_dot7u_vnni                15.3 ns         15.3 ns     45628528
BM_dot7u_vnni_w              9.80 ns         9.80 ns     71438099
BM_dot8s_avx512_vnni         17.7 ns         17.7 ns     39746725
BM_dot8s_avx512_vnni_w       14.1 ns         14.1 ns     49663030
BM_dot7u_avx512_vnni_w       9.68 ns         9.67 ns     72473372
BM_sqr8s_scalar               565 ns          565 ns      1238257
BM_sqr8s_avx2                36.6 ns         36.6 ns     19149821
BM_sqr7u_avx2                19.1 ns         19.1 ns     36647360
BM_sqr7u_avx2_w              12.5 ns         12.5 ns     56013088
BM_sqr7u_avx512              11.6 ns         11.6 ns     60300342

ldematte · 2024-05-03T17:30:57Z

Last update: processor architecture

We use the following instance types:

GCP - N2D (n2d-highmem-8)

N2D are AMD Milan (Zen2) or AMD Rome (Zen3).
AMD Milan (Zen2) or AMD Rome (Zen3) support only AVX2 unfortunately.
N2D was an "unfortunate" choice in this case; N2 instances are similar, but on Intel (Ice Lake/Cascade Lake). Those would support both AVX-512 and AVX512_VNNI/AVX512VL.

Azure - Lsv3 series (Standard_L8s_v3)

Lsv3 are on Intel Ice Lake, supporting both AVX-512 and AVX512_VNNI/AVX512VL.

Finally, a small table

Instructions| AMD                       | Intel
---------------------------------------------------------------------------------
AVX-512     | Zen4 (EPYC Genoa/Bergamo) | Cascade Lake (Xeon scalable 1st gen)
AVX512_VNNI | Zen4 (EPYC Genoa/Bergamo) | Cascade Lake
AVX-VNNI    | Zen5                      | Sapphire rapids (Xeon scalable 4th gen)

AVX-VNNI is a "Backport" of AVX512_VNNI - so it's likely more recent. On server processors, AVX512_VNNI seems more diffused: where AVX-512 is present, AVX512_VNNI/AVX512VL is supported too.

For int8, we might want to go with our fastest implementation with VNNI instructions.

For now, with uint7, seems like AVX2 would be OK. After #108243 is merged, I will update this PR with the latest code.

…asticsearch into native-vec-linux-x64

…rch into native-vec-linux-x64

ldematte · 2024-05-08T16:36:54Z

JMH benckmarks show a less brilliant picture:

Benchmark                                   (dims)   Mode  Cnt   Score   Error   Units
VectorScorerBenchmark.dotProductLucene        1024  thrpt    5  17.361 ± 0.098  ops/us
VectorScorerBenchmark.dotProductNative        1024  thrpt    5  29.249 ± 0.346  ops/us
VectorScorerBenchmark.dotProductScalar        1024  thrpt    5   2.774 ± 0.009  ops/us
VectorScorerBenchmark.squareDistanceLucene    1024  thrpt    5  14.446 ± 0.513  ops/us
VectorScorerBenchmark.squareDistanceNative    1024  thrpt    5  30.536 ± 0.225  ops/us
VectorScorerBenchmark.squareDistanceScalar    1024  thrpt    5   2.552 ± 0.014  ops/us

We do have an improvement over Lucene (around 1.7x for dot, 2x for sqr), but not the 3x that the C micro-benchmarks suggested we might have.
@ChrisHegarty and I think we should go through it with this anyway, and follow up with AVX-512 on a separate PR (if this gives us some speedup - AVX-512 is tricky, I get different speedups on different processors/different loads, due to throttling).

EDIT: this is Lucene AVX-512 against Native AVX2

docs/changelog/108088.yaml

ChrisHegarty

LGTM

ChrisHegarty · 2024-05-08T16:49:16Z

libs/vec/native/src/vec/c/aarch64/vec.c

@@ -121,7 +121,7 @@ static inline int32_t sqr7u_inner(int8_t *a, int8_t *b, size_t dims) {
 EXPORT int32_t sqr7u(int8_t* a, int8_t* b, size_t dims) {
 int32_t res = 0;
 int i = 0;
- if (i > SQR7U_STRIDE_BYTES_LEN) {
+ if (dims > SQR7U_STRIDE_BYTES_LEN) {


Oops, thank you.

No prob! It drove me mad yesterday because I was seeing it running so much slower :)
I understood what was going on only after disassembling the function and seeing no inline vector code nor function call...
Thanks for updating the title!

Co-authored-by: Chris Hegarty <[email protected]>

ldematte · 2024-05-08T17:01:52Z

@elasticmachine update branch

…rch into native-vec-linux-x64

ldematte · 2024-05-10T09:54:15Z

The only failure of CI is for 7.17.22 / bwc-snapshots-windows - this is a know issue (#108474) and reproduces on main without this PR, so I'm going to ignore it and merge

ldematte and others added 3 commits April 30, 2024 14:13

Adding support for x64 to native vec library

664e38d

Fix output paths for non-arch dependent Dockerfile

ad9a246

Comments

f4d2aa5

ldematte requested review from a team as code owners April 30, 2024 12:43

elasticsearchmachine added needs:triage Requires assignment of a team area label v8.15.0 labels Apr 30, 2024

ldematte added :Core/Infra/Core Core issues without another label :Search/Search Search-related issues that do not fall into other categories and removed needs:triage Requires assignment of a team area label labels Apr 30, 2024

elasticsearchmachine added Team:Core/Infra Meta label for core/infra team Team:Search Meta label for search team labels Apr 30, 2024

ldematte added >enhancement test-windows Trigger CI checks on Windows test-arm Pull Requests that should be tested against arm agents and removed Team:Core/Infra Meta label for core/infra team Team:Search Meta label for search team labels Apr 30, 2024

elasticsearchmachine added Team:Core/Infra Meta label for core/infra team Team:Search Meta label for search team labels Apr 30, 2024

Update docs/changelog/108088.yaml

f7dc706

ldematte requested a review from ChrisHegarty April 30, 2024 12:46

ldematte added 2 commits April 30, 2024 14:47

Changelog

7a50b7d

Tabs to spaces

06dd9b8

Merge remote-tracking branch 'upstream/main' into native-vec-linux-x64

7e9a6e5

mark-vieira approved these changes Apr 30, 2024

View reviewed changes

ldematte and others added 8 commits May 6, 2024 09:00

Merge remote-tracking branch 'upstream/main' into native-vec-linux-x64

330259d

Merge remote-tracking branch 'upstream/main' into native-vec-linux-x64

3cf8ac6

Moving from int8 to uint7

c232961

Merge remote-tracking branch 'upstream/main' into native-vec-linux-x64

b5327f1

Merge branch 'native-vec-linux-x64' of https://github.com/ldematte/el…

b5e05ed

…asticsearch into native-vec-linux-x64

Merge branch 'native-vec-linux-x64' of github.com:ldematte/elasticsea…

9060ac4

…rch into native-vec-linux-x64

Fix inline and dims

6e666f4

Fix aarch64 sqr7u too, update scripts for new native vec library version

d003507

ChrisHegarty changed the title ~~Add a SIMD (AVX2) optimised vector distance function for int8 on x64~~ Add a SIMD (AVX2) optimised vector distance function for int7 on x64 May 8, 2024

ChrisHegarty reviewed May 8, 2024

View reviewed changes

docs/changelog/108088.yaml Outdated Show resolved Hide resolved

ChrisHegarty approved these changes May 8, 2024

View reviewed changes

Update docs/changelog/108088.yaml

f9594bf

Co-authored-by: Chris Hegarty <[email protected]>

elasticmachine and others added 4 commits May 8, 2024 18:01

Merge branch 'main' into native-vec-linux-x64

c89b1e1

Fix: add symbol stripping (deb lintian)

06baf86

Merge branch 'native-vec-linux-x64' of github.com:ldematte/elasticsea…

9f85c8b

…rch into native-vec-linux-x64

Merge remote-tracking branch 'upstream/main' into native-vec-linux-x64

9f757fb

ldematte merged commit 2e0f8d0 into elastic:main May 10, 2024
24 of 27 checks passed

ldematte deleted the native-vec-linux-x64 branch May 10, 2024 09:58

mark-vieira mentioned this pull request May 10, 2024

[CI] DebMetadataTests test05CheckLintian failing #108514

Closed

This was referenced May 10, 2024

Adding override for lintian false positive on libvec.so #108521

Merged

Native vec for x64 ChrisHegarty/elasticsearch#8

Closed

ChrisHegarty mentioned this pull request May 23, 2024

Performance regression in libvec on ARM #108936

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a SIMD (AVX2) optimised vector distance function for int7 on x64 #108088

Add a SIMD (AVX2) optimised vector distance function for int7 on x64 #108088

ldematte commented Apr 30, 2024 •

edited

elasticsearchmachine commented Apr 30, 2024

elasticsearchmachine commented Apr 30, 2024

elasticsearchmachine commented Apr 30, 2024

ChrisHegarty commented Apr 30, 2024

benwtrent commented Apr 30, 2024

ldematte commented May 2, 2024 •

edited

ldematte commented May 3, 2024

ldematte commented May 3, 2024

ldematte commented May 3, 2024

ldematte commented May 3, 2024 •

edited

ldematte commented May 8, 2024 •

edited

ChrisHegarty left a comment

ChrisHegarty May 8, 2024

ldematte May 8, 2024

ldematte commented May 8, 2024

ldematte commented May 10, 2024 •

edited

Add a SIMD (AVX2) optimised vector distance function for int7 on x64 #108088

Add a SIMD (AVX2) optimised vector distance function for int7 on x64 #108088

Conversation

ldematte commented Apr 30, 2024 • edited

elasticsearchmachine commented Apr 30, 2024

elasticsearchmachine commented Apr 30, 2024

elasticsearchmachine commented Apr 30, 2024

ChrisHegarty commented Apr 30, 2024

benwtrent commented Apr 30, 2024

ldematte commented May 2, 2024 • edited

ldematte commented May 3, 2024

ldematte commented May 3, 2024

ldematte commented May 3, 2024

ldematte commented May 3, 2024 • edited

GCP - N2D (n2d-highmem-8)

Azure - Lsv3 series (Standard_L8s_v3)

ldematte commented May 8, 2024 • edited

ChrisHegarty left a comment

Choose a reason for hiding this comment

ChrisHegarty May 8, 2024

Choose a reason for hiding this comment

ldematte May 8, 2024

Choose a reason for hiding this comment

ldematte commented May 8, 2024

ldematte commented May 10, 2024 • edited

ldematte commented Apr 30, 2024 •

edited

ldematte commented May 2, 2024 •

edited

ldematte commented May 3, 2024 •

edited

ldematte commented May 8, 2024 •

edited

ldematte commented May 10, 2024 •

edited