Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize visiting neighbors #386

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

hlinnaka
Copy link
Contributor

Here are two more micro-optimizations of HnswSearchLayer, for HNSW index build

1st Commit: Add a bulk-variant of AddToVisited

The idea is to move code around so that we collect all the 'hash' values to an array in a tight loop, before performing the hash table lookups. This codepath causes a lot of CPU cache misses, as the elements are scattered around memory, and performing all the fetches upfront allows the CPU to schedule fetching those cachelines sooner. That's my theory of why this works, anyway :-).

This gives a 5%-10% speedup on my laptop, on HNSW index build of a subset of the 'angular-100' dataset (same test I used on my previous PRs). I'd love to hear how this performs on other systems, as this could be very dependent on CPU details.

2nd & 3rd commits: Calculate 4 neighbor distances at a time in HNSW search

This is just a proof of concept at this stage, but shows promising results. The idea is to have a variant of the distance function that calculates the distance from one point 'q' to 4 other points in one call. That gives the vectorized loop in the distance function more work to do in each iteration. If you think this is worthwhile, I can spend more time polishing this, adding these array-variants as proper AM support functions, etc.

This gives another 10% speedup on the same test on my laptop. It could possibly be optimized further, by providing variants with different array sizes, or a variable-length version and let the function itself vectorize it in the optimal way. With some refactoring, I think we could also use this in CheckElementCloser(). This might also work well together with PR #311, but I haven't tested that.

@jkatz
Copy link
Contributor

jkatz commented Dec 22, 2023

🚀 I still have on my TODO to run some serious benchmarks on these patches; I'll work to get those up and running to see how it performs under different contexts on some larger machines. Given they're focused on index building, my thought process is to test:

  1. "Regression" tests using the ANN Benchmark suite to capture both performance/recall -- need to do two sets: with and without parallel build
  2. Increasing concurrent building benchmarks (discussed here)
  3. Testing with the parallel build benchmark on different large data sets (10MM + 100MM) with a focus on timing

@ankane
Copy link
Member

ankane commented Jan 8, 2024

Hey @hlinnaka, thanks for more PRs! I'm currently working on in-memory parallel index builds, but will dig into this and #388 once that's finished (as it may affect the impact on build time).

Thanks to CPU cache effects (I think), it's faster to fetch all the
hashes first. This gives a 5%-10% speedup in HNSW build of a
100-dimension on my laptop.
Introduce a helper function to calculate the distances of all
candidates in an array. It doesn't change much on its own, but paves
the way for further optimizations, in next commit.
In my testing, this gives a further 10% speedup in HNSW index build.
@hlinnaka
Copy link
Contributor Author

Rebased this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants