HNSW + dead tuples: recall loss/usability issues #244

alanwli · 2023-08-29T17:51:52Z

This issue is a follow-up on this discussion in #239.

Scenario: User sets ef_search to 10 expecting to get top-10 results back. But the top-10 results in the HNSW index happen to be dead tuples (due to updates & deletes), then the query will return 0 results. While this doesn't impact things like ann-benchmark, it will impact more realistic usecases where applications update/delete data.

Desired Behavior: HNSW index returns the top-10 results where the tuples aren't dead to avoid causing significant recall loss - e.g. in the scenario above, recall would be 0%. Note that ivfflat has the desired behavior.

I can see some potentially ugly workarounds:

users always performs a vacuum on every update/delete

con: resource intensive
con: complexity on application logic

users performs more frequent periodic vacuum

con: should theoretically minimize the problem but offers no bounded guarantees, user can still hit the above problem between the periodic vacuums

users can specify higher ef_search

con: how will a user know what ef_search to specify?
con: ef_search can be specified up to 1k, which should theoretically minimize the problem but also offers no bounded guarantees - e.g. user can still hit the above problem when the top-1k results of a large dataset in HNSW are dead

And I think the ugly workarounds will fall apart especially for larger datasets, so having a principled fix in the HNSW index would help real-world users.

Thoughts?

The text was updated successfully, but these errors were encountered:

pashkinelfe · 2023-08-31T11:18:32Z

I consider massive updates/deletes for pgvector a rare use case.

It could be recommended to set ef_search with overhead to the number of tuples in the limit clause rather than the exact value for these workload types (Or for all workloads just to be safe). It's easy to evaluate how many dead tuples are expected for a known update/delete rate between vacuums.

So setting ef_search to a fixed (big) value is not necessary.

ankane · 2023-09-01T20:02:32Z

Hi @alanwli, I'm not sure there's a way to address this that doesn't affect search speed, but it may be worth the tradeoff. I've pushed a solution to kill-prior-tuple branch, but need to think about it more. The two things that will affect search speed are:

Marking tuples as deleted (since scan->kill_prior_tuple is the only way to get this information). We could limit this to a certain number per scan to minimize the impact.
Needing to search more tuples. This wouldn't be fixed until the graph is repaired with vacuum.

nlgtuankiet · 2023-09-02T13:28:25Z

Hi @ankane

but it may be worth the tradeoff.

As an end user, I think people would prefer accuracy, and relevancy over speed as long as it doesn't get too slow.
Suppose Google search takes 0.3 seconds to display the search result, then 0.6 seconds (x2 search time) is still ok.
But if it takes like 2 seconds then it becomes noticeable and annoying.

ankane · 2023-09-02T23:50:42Z

Here's a rough summary of the situation:

As elements get updated/deleted, either speed or recall has to give until the graph can be repaired (which currently happens on vacuum, as it's expensive). To guarantee results in every case, it's possible the entire graph would need to be visited (for instance, after deleting 99% of rows), which could be extremely slow (and likely cause SELECTs to tie up the database CPU).

With the current approach, speed will be consistent across queries, but recall will give.

The branch now gives up speed for recall by increasing ef_search up to 2x. This means that queries can slow down to a certain point, and then recall will give.

The impact either approach has on recall and speed will be different for each dataset.

Also, by default, tables are autovacuumed when 20% of rows (+ 50) change, which should keep the graph in a healthy state in many cases with either approach.

nlgtuankiet · 2023-09-05T09:05:30Z

So,

On master branch:
ef_search = 10
-> 8 rows return (2 are deleted or updated and get filtered out)

on kill-prior-tuple branch:
ef_search = 10
-> 10 rows return (8 in 10 search + 2 in x search up to 10 search?)

Do I understand it correctly?

alanwli · 2023-09-05T21:57:49Z

Hi @ankane - thanks for looking into a solution for the problem. I agree with the discussion thus far that this is yet another speed vs. recall tradeoff, and it's likely many other ANN algorithms don't have to worry about since they don't care about transactional semantics. Do you have a sense how much overhead your kill_prior_tuple solution would incur?

Hi @alanwli, I'm not sure there's a way to address this that doesn't affect search speed, but it may be worth the tradeoff. I've pushed a solution to kill-prior-tuple branch, but need to think about it more. The two things that will affect search speed are:

Marking tuples as deleted (since scan->kill_prior_tuple is the only way to get this information). We could limit this to a certain number per scan to minimize the impact.

Needing to search more tuples. This wouldn't be fixed until the graph is repaired with vacuum.

MMeent · 2023-09-07T08:58:25Z

1. Marking tuples as deleted (since `scan->kill_prior_tuple` is the only way to get this information).

The built-in btree AM does some optimistic cleanup for non-hot updates where the indexed value hasn't changed, and does this during index tuple insertion (since PostgreSQL 14).

Might you be able to do something similar? The underlying assumption of the optimization is "we've just had a non-hot update, there might be more non-hot updated tuples, and those other tuples may already be dead, so let's try to find some dead non-HOT-updated tuples to clean up".

Ref: _bt_bottomupdel_pass

JimmyLv · 2023-10-11T14:25:40Z

When I run create index on documents using hnsw (embedding vector_ip_ops); on Supabase based on this doc https://supabase.com/blog/increase-performance-pgvector-hnsw
I also met this issue: hnsw graph no longer fts into maintenance_work_mem after 9866 tuples

juanandreas · 2024-05-01T18:09:30Z

has there been a solution for this? I am getting the following notice as well. The index creation also takes hours and fails. How can we scale the parameters as our data grows 10-20x?

CREATE INDEX ON table USING hnsw (vector_column_name vector_ip_ops) WITH (m=16, ef_construction = 32);

NOTICE:  hnsw graph no longer fits into maintenance_work_mem after 313246 tuples
DETAIL:  Building will take significantly more time.
HINT:  Increase maintenance_work_mem to speed up builds.

ankane · 2024-05-01T22:00:21Z

That notice is not related to this issue. Please see the docs on index build time for how to resolve.

jkatz mentioned this issue Sep 28, 2023

Interest in DiskANN #285

Closed

VoVAllen mentioned this issue Oct 17, 2023

blog: Introduce filter support in pgvecto.rs versus pgvector tensorchord/pgvecto.rs#94

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HNSW + dead tuples: recall loss/usability issues #244

HNSW + dead tuples: recall loss/usability issues #244

alanwli commented Aug 29, 2023

pashkinelfe commented Aug 31, 2023

ankane commented Sep 1, 2023

nlgtuankiet commented Sep 2, 2023 •

edited

ankane commented Sep 2, 2023

nlgtuankiet commented Sep 5, 2023

alanwli commented Sep 5, 2023

MMeent commented Sep 7, 2023

JimmyLv commented Oct 11, 2023

juanandreas commented May 1, 2024

ankane commented May 1, 2024

HNSW + dead tuples: recall loss/usability issues #244

HNSW + dead tuples: recall loss/usability issues #244

Comments

alanwli commented Aug 29, 2023

pashkinelfe commented Aug 31, 2023

ankane commented Sep 1, 2023

nlgtuankiet commented Sep 2, 2023 • edited

ankane commented Sep 2, 2023

nlgtuankiet commented Sep 5, 2023

alanwli commented Sep 5, 2023

MMeent commented Sep 7, 2023

JimmyLv commented Oct 11, 2023

juanandreas commented May 1, 2024

ankane commented May 1, 2024

nlgtuankiet commented Sep 2, 2023 •

edited