Add AMX support to speed up Faiss Inner-Product #535

mellonyou · 2024-04-28T09:13:11Z

Use Intel AMX to speed up Inner-Product algorithm of knowhere::BruteForce::Search(), which can bring more than 10x performance boost.

Build parameter: use "-o with_dnnl=True/False" to control enable/disable AMX feature.
This feature will depends on libdnnl.so.3, you can install it by running scripts/install_deps.sh.

Runtime parameter: if you want use AMX feature, you need set ENV parameter "DNNL_ENABLE=1" at first, otherwise the AMX feature will not work.

sre-ci-robot · 2024-04-28T09:13:17Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mellonyou
To complete the pull request process, please assign zhengbuqian after the PR has been reviewed.
You can assign the PR to them by writing /assign @zhengbuqian in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mergify · 2024-04-28T09:13:49Z

@mellonyou 🔍 Important: PR Classification Needed!

For efficient project management and a seamless review process, it's essential to classify your PR correctly. Here's how:

If you're fixing a bug, label it as kind/bug.
For small tweaks (less than 20 lines without altering any functionality), please use kind/improvement.
Significant changes that don't modify existing functionalities should be tagged as kind/enhancement.
Adjusting APIs or changing functionality? Go with kind/feature.

For any PR outside the kind/improvement category, ensure you link to the associated issue using the format: “issue: #”.

Thanks for your efforts and contribution to the community!.

Signed-off-by: Fangzheng Zhang <[email protected]>

mellonyou · 2024-05-06T02:48:20Z

issue: #541

mellonyou · 2024-05-06T03:03:02Z

I can't edit the labels, need any access permissions?

liliu-z · 2024-05-06T03:17:59Z

/kind enhancement

codecov · 2024-05-06T04:16:06Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.59%. Comparing base (3c46f4c) to head (7b6f49a).
Report is 95 commits behind head on main.

❗ Current head 7b6f49a differs from pull request most recent head b420761

Please upload reports for the commit b420761 to get more accurate results.

Additional details and impacted files

@@            Coverage Diff            @@
##           main     #535       +/-   ##
=========================================
+ Coverage      0   71.59%   +71.59%     
=========================================
  Files         0       67       +67     
  Lines         0     4446     +4446     
=========================================
+ Hits          0     3183     +3183     
- Misses        0     1263     +1263

see 67 files with indirect coverage changes

liliu-z · 2024-05-06T07:52:01Z

thirdparty/faiss/faiss/utils/onednn_utils.h

+ BaseData::getState().store(BASE_DATA_STATE::MODIFIED);
+ }
+
+ void execut(float** out_f32) {


nit: execute?

yes, it's a typo

liliu-z · 2024-05-06T08:05:39Z

thirdparty/faiss/faiss/utils/onednn_utils.h

+ // inner memory bf16
+ bf16_md1 = dnnl::memory::desc({xrow, xcol}, dnnl::memory::data_type::bf16, dnnl::memory::format_tag::any);
+ bf16_md2 = dnnl::memory::desc({yrow, ycol}, dnnl::memory::data_type::bf16, dnnl::memory::format_tag::any);


Noob Q, why we use bf16 here?

Because AMX can native support for bf16/int8 compute, which can significantly improve performance, and we have done the test, it have little impact on accuracy.

liliu-z · 2024-05-06T08:19:28Z

thirdparty/faiss/faiss/utils/onednn_utils.h

+ BASE_DATA_STATE expected = BASE_DATA_STATE::MODIFIED;
+
+ if (BaseData::getState().compare_exchange_strong(expected, BASE_DATA_STATE::PREPARE)) {
+ pthread_rwlock_wrlock(&rwlock);


Noob Q, why we need to lock this. Is that because we only have only AMX instruction can run at a time?

The lock is designed for multi-thread scenario, if two threads operate on the same base dataset with different query dataset, the lock prevent the base dataset from being modified by the other thread while working on it.

liliu-z · 2024-05-06T08:23:59Z

thirdparty/faiss/faiss/utils/onednn_utils.h

+ dnnl::reorder(f32_mem1, bf16_mem1).execute(engine_stream, f32_mem1, bf16_mem1);
+ BASE_DATA_STATE expected = BASE_DATA_STATE::MODIFIED;
+
+ if (BaseData::getState().compare_exchange_strong(expected, BASE_DATA_STATE::PREPARE)) {


Plz CMIIW. In the first call, expected will be BASE_DATA_STATE::MODIFIED and changed into BASE_DATA_STATE::PREPARE in this line and return false. Then it will loop in line 196 forever

The state is also designed for multi-thread scenario, the state change is INIT->MODIFIED -> PREPARE -> READY. When the first thread have finished the initialization, the other thread will get the state is READY, and then skip line 196.

liliu-z · 2024-05-06T08:28:10Z

thirdparty/faiss/faiss/utils/distances.cpp

+ if (is_dnnl_enabled()) {
+ float *res_arr = NULL;
+
+ comput_f32bf16f32_inner_product(nx, d, ny, d, const_cast<float*>(x), const_cast<float*>(y), &res_arr);


Can we implement a dynamic hook like all other simd in Knowhere?

We have also considered following the other simd interface, but due to the implementation of AMX, it may be a bit incompatible with the current interface:

AMX prefers batch data calculation, and it's library will schedule multiple threads on its own.

The return value is a array for batch data operation.
So if we use dynamic hook, maybe need add new interface for batch data operation, and call the new interface when AMX is available.

@liliu-z We are planning to port code to adapt dynamic hook, do you have any other suggestions?

alexanderguzhva · 2024-05-06T11:32:56Z

thirdparty/faiss/faiss/utils/distances.cpp

@@ -211,30 +214,59 @@ void exhaustive_inner_product_seq_impl(
 using SingleResultHandler = typename BlockResultHandler::SingleResultHandler;
 int nt = std::min(int(nx), omp_get_max_threads());

+#ifdef FAISS_WITH_DNNL


the problem here is that this code is inserted into the function that computes inner products according to a filter. So, if the filter filters out 90% of samples, then 9 out of 10 computed distances will not be used, costing quite an extra memory bandwidth.
Benchmarks are needed for this PR.

@alexanderguzhva The filter is inside Knowhere or in the Milvus?

@xtangxtang an external filter (in the form of bitset), provided from Milvus

…k interface.

…mx_ip

mellonyou · 2024-05-15T11:53:14Z

port the code to knowhere to follow dynamic hook interface.
about filter, write a simple benchmark to compare no filter amx inner product with simd inner product with filer(0.1f, 0.5f, 0.9f), it can be seen that AMX still has a perf. advantage when the filter percentage reaches 0.9.
percentage 0.1 0.5 0.9 amx
result(s) 0.432 0.208 0.043 0.033
dnnl perf. boost 13.1x 6.3x 1.3x

For the code, the amx inner product interface is more suitable for producing batch vectors, and it doesn't support a filter interface, I have two ideas:

amx inner product just handle no filter scenario.
add a percentage parameter to the interface, when it is less than 0.9, we choose amx inner product.

@alexanderguzhva Looking forward to your suggestions.

alexanderguzhva · 2024-05-15T16:51:15Z

@mellonyou Could you please include a benchmark or, at least, its details?
The numbers that you've provided cannot be interpreted properly without knowing

the exact number of samples
the dimensionality
whether it is a single query/batched query requests
is it a test for this particular function or for a whole index,
etc.

The results are potentially interesting and are definitely worth checking on my end.

mellonyou · 2024-05-16T01:13:33Z

#include "simd/distances_onednn.h"

#define MAX_LOOP 20
TEST_CASE("Test Brute Force", "[float vector]") {
using Catch::Approx;

const int64_t nb = 2000000;
const int64_t nq = 10;
const int64_t dim = 512;
const int64_t k = 100;

auto metric = GENERATE(as<std::string>{}, knowhere::metric::IP );

const auto train_ds = GenDataSet(nb, dim);
const auto query_ds = CopyDataSet(train_ds, nq);

const knowhere::Json conf = {
    {knowhere::meta::DIM, dim},
    {knowhere::meta::METRIC_TYPE, metric},
    {knowhere::meta::TOPK, k},
    {knowhere::meta::RADIUS, knowhere::IsMetricType(metric, knowhere::metric::IP) ? 10.0 : 0.99},
};

SECTION("Test Search Batch") {
 faiss::BaseData::getState().store(faiss::BASE_DATA_STATE::MODIFIED);
 struct timeval t1,t2;
 double timeuse;
 gettimeofday(&t1,NULL);

     std::vector<std::function<std::vector<uint8_t>(size_t, size_t)>> gen_bitset_funcs = {
             GenerateBitsetWithFirstTbitsSet, GenerateBitsetWithRandomTbitsSet};
     const auto bitset_percentages = {0.1f, 0.5f, 0.9f};
     for (const float percentage : bitset_percentages) {
             for (const auto& gen_func : gen_bitset_funcs) {
                     auto bitset_data = gen_func(nb, percentage * nb);
                     knowhere::BitsetView bitset(bitset_data.data(), nb);

                     for (int i = 0; i < MAX_LOOP; i++)
                     {
                             gettimeofday(&t1,NULL);

                             //    threads.emplace_back(WrapSearch, queryvar1);
                             auto res = knowhere::BruteForce::Search<knowhere::fp32>(train_ds, query_ds, conf, bitset);
                             gettimeofday(&t2,NULL);
                             timeuse = (t2.tv_sec - t1.tv_sec) + (double)(t2.tv_usec - t1.tv_usec)/1000000.0;
                             std::cout << "elpased: " << timeuse << std::endl;
                     }

             }
     }

     gettimeofday(&t2,NULL);
     timeuse = (t2.tv_sec - t1.tv_sec) + (double)(t2.tv_usec - t1.tv_usec)/1000000.0;

     std::cout << "All thread finished." << std::endl;

    }

}

mellonyou · 2024-05-16T01:19:09Z

@alexanderguzhva I just add this code to ut as a temporary benchmark, and build it with "-o with_dnnl=True", then run the test:
DNNL_ENABLE=0/1 ./Release/tests/ut/knowhere_tests
The test will run 20 rounds, and the results above are the average after discarding the best 20% and the worst 20%. And I ran the test on Intel SPR platform with Ubuntu 22.04 system.

alexanderguzhva · 2024-05-21T16:45:08Z

@mellonyou I'll take a look. Thanks!

godchen0212 · 2024-06-04T03:54:06Z

src/simd/distances_onednn.h

+ BaseData::getState().store(BASE_DATA_STATE::MODIFIED);
+ }
+
+ void execut(float** out_f32) {


There is a typo.

…nednn. Signed-off-by: Eric Zhang <[email protected]>

mellonyou · 2024-06-05T03:11:33Z

Add searchwithbuf and rangesearch interface implementation with AMX onednn. And will submit the related build config into milvus later.

mellonyou · 2024-06-18T03:40:35Z

I am trying to do a manual filter with multithread before AMX IP.
@liliu-z @alexanderguzhva @godchen0212 Do you have any other opinions on the current interface implementation.

mellonyou added 2 commits April 26, 2024 16:43

Add AMX support to speed up Faiss Inner-Product

a28d310

Merge branch 'zilliztech:main' into main

5d2afb3

sre-ci-robot requested review from cqy123456 and Presburger April 28, 2024 09:13

sre-ci-robot added the size/L label Apr 28, 2024

mergify bot added the dco-passed label Apr 28, 2024

mergify bot added the do-not-merge/missing-related-issue label Apr 28, 2024

Add AMX support to speed up Faiss Inner-Product

7b6f49a

Signed-off-by: Fangzheng Zhang <[email protected]>

mellonyou marked this pull request as draft May 6, 2024 02:16

sre-ci-robot added the do-not-merge/work-in-progress label May 6, 2024

mellonyou marked this pull request as ready for review May 6, 2024 02:58

sre-ci-robot removed the do-not-merge/work-in-progress label May 6, 2024

sre-ci-robot added the kind/enhancement label May 6, 2024

mergify bot added the ci-passed label May 6, 2024

liliu-z reviewed May 6, 2024

View reviewed changes

alexanderguzhva reviewed May 9, 2024

View reviewed changes

mellonyou added 3 commits May 15, 2024 15:46

Port the onednn code to knowhere, and modify it to follow dynamic hoo…

64a8804

…k interface.

Merge branch 'zilliztech:main' into main

71fd0cf

Merge branch 'zilliztech:main' into amx_ip

420b8c2

mergify bot removed the ci-passed label May 15, 2024

Merge branch 'main' into amx_ip

64bd1b9

sre-ci-robot added size/XL and removed size/L labels May 15, 2024

mergify bot added the needs-dco label May 15, 2024

mergify bot removed the dco-passed label May 15, 2024

mellonyou added 4 commits May 15, 2024 19:28

Merge branch 'main' into amx_ip

47b8f42

Merge branch 'amx_ip' of https://github.com/mellonyou/knowhere into a…

2dbb422

…mx_ip

Merge branch 'amx_ip' of https://github.com/mellonyou/knowhere into a…

1d281ae

…mx_ip

Merge branch 'amx_ip' of https://github.com/mellonyou/knowhere into a…

9b3c1b1

…mx_ip

sre-ci-robot added size/L and removed size/XL labels May 15, 2024

mellonyou added 2 commits May 15, 2024 19:35

Merge branch 'amx_ip' of https://github.com/mellonyou/knowhere into a…

961e7cf

…mx_ip

Merge branch 'amx_ip' of https://github.com/mellonyou/knowhere into a…

5c17682

…mx_ip

alexanderguzhva mentioned this pull request May 24, 2024

Enable Intel®-AMX/oneDNN to accelerate IndexFlatIP search facebookresearch/faiss#3266

Open

Merge branch 'zilliztech:main' into amx_ip

1169f95

godchen0212 reviewed Jun 4, 2024

View reviewed changes

mellonyou added 2 commits June 5, 2024 10:37

Merge branch 'zilliztech:main' into amx_ip

62eba85

Add searchwithbuf and rangesearch interface implementation with AMX o…

417601b

…nednn. Signed-off-by: Eric Zhang <[email protected]>

sre-ci-robot added size/XL and removed size/L labels Jun 5, 2024

This was referenced Jun 5, 2024

enhance: Add WITH_DNNL build config for knowhere. milvus-io/milvus#33628

Closed

enhance: Add WITH_DNNL build config for knowhere. milvus-io/milvus#33630

Open

Merge branch 'zilliztech:main' into amx_ip

b420761

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AMX support to speed up Faiss Inner-Product #535

Add AMX support to speed up Faiss Inner-Product #535

mellonyou commented Apr 28, 2024

sre-ci-robot commented Apr 28, 2024

mergify bot commented Apr 28, 2024

mellonyou commented May 6, 2024

mellonyou commented May 6, 2024

liliu-z commented May 6, 2024

codecov bot commented May 6, 2024 •

edited

liliu-z May 6, 2024

mellonyou May 7, 2024

liliu-z May 6, 2024

mellonyou May 7, 2024

liliu-z May 6, 2024

mellonyou May 7, 2024

liliu-z May 6, 2024

mellonyou May 7, 2024

liliu-z May 6, 2024

mellonyou May 7, 2024

mellonyou May 9, 2024

alexanderguzhva May 6, 2024

xtangxtang May 10, 2024

alexanderguzhva May 10, 2024

mellonyou commented May 15, 2024

alexanderguzhva commented May 15, 2024 •

edited

mellonyou commented May 16, 2024

mellonyou commented May 16, 2024

alexanderguzhva commented May 21, 2024

godchen0212 Jun 4, 2024

mellonyou commented Jun 5, 2024

mellonyou commented Jun 18, 2024

Add AMX support to speed up Faiss Inner-Product #535

Are you sure you want to change the base?

Add AMX support to speed up Faiss Inner-Product #535

Conversation

mellonyou commented Apr 28, 2024

sre-ci-robot commented Apr 28, 2024

mergify bot commented Apr 28, 2024

mellonyou commented May 6, 2024

mellonyou commented May 6, 2024

liliu-z commented May 6, 2024

codecov bot commented May 6, 2024 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mellonyou commented May 15, 2024

alexanderguzhva commented May 15, 2024 • edited

mellonyou commented May 16, 2024

mellonyou commented May 16, 2024

alexanderguzhva commented May 21, 2024

Choose a reason for hiding this comment

mellonyou commented Jun 5, 2024

mellonyou commented Jun 18, 2024

codecov bot commented May 6, 2024 •

edited

alexanderguzhva commented May 15, 2024 •

edited