Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Setting search defaults as an index option #235

Open
jkatz opened this issue Aug 22, 2023 · 8 comments
Open

RFC: Setting search defaults as an index option #235

jkatz opened this issue Aug 22, 2023 · 8 comments

Comments

@jkatz
Copy link
Contributor

jkatz commented Aug 22, 2023

Use cases

There are cases where a user wants to set a default search value (e.g. ivfflat.probes, hnsw.ef_search) value on a per-index basis, for example:

  1. Data is partitioned across multiple tables (integer-based partition key, not related to vector) with indexes on each table. This avoids setting the value on a per-query basis
  2. User has a single index, but doesn't want to set the search values on a per-query basis

Proposal

The proposal is to add new index options on the index AMs:

  1. ivfflat: default_probes (default: NULL)
  2. hnsw: default_ef_search (default: NULL)

PostgreSQL tracks when a GUC is changed. For this case, we can use the PGC_S_SESSION indicator to determine that GUC is changed in the database sessions and therefore we do not need to consult the index option.

The proposal would then use this heuristic:

if default_probes/default_ef_search is NULL
    use value from `ivfflat.probes`/ `hnsw.ef_search`
else
  if `ivfflat.probes`/`hnsw.ef_search` source is PGC_S_SESSION, use value of `ivfflat.probes`/`hnsw.ef_search`
  else use value from `default_probes`/`default_ef_search`

Alternatives

With additional work, we could set "smarter" automatic defaults for ivfflat.probes/hnsw.ef_search, but folks may still want to set their own index-specific defaults based on the above use cases.

Seeking feedback on this proposal.

@GPF199541
Copy link

GPF199541 commented Aug 23, 2023

support select * from emb_table where xx with (ivfflat.probes=10)
Is this more convenient and flexible?

@jkatz
Copy link
Contributor Author

jkatz commented Aug 23, 2023

@GPF199541 I'm not following, today you can do:

SET ivfflat.probes TO 10;
SELECT * FROM emb_table WHERE ...

What this proposes is setting a default for using the index, so if you plan to use ivfflat.probes to 10 for all queries that use the index, you just write:

SELECT * FROM emb_table ... ORDER BY ...

and pgvector would use 10 probes for the search if you created the index with default_probes to 10.

@pavel-alexeichik
Copy link

pavel-alexeichik commented Aug 23, 2023

I believe something like
select * from emb_table where xx with (ivfflat.probes=10)
also makes sense, as ivfflat.probes in that case is query-specific rather than connection-specific.

But having a default value for each index as @jkatz proposed would be the most convenient, as otherwise different values of ivfflat.probes for each separate index will need to be stored somewhere else (probably, in a separate table or some memory cache).

I also want to emphasize that having connection-specific ivfflat.probes can be a problem when connections are shared by different clients and requests with different ivfflat.probes need to be run in parallel.
For example,

SET ivfflat.probes TO 10;
SELECT * FROM emb_table_1 WHERE ..
SET ivfflat.probes TO 20;
SELECT * FROM emb_table_2 WHERE ..

can become

SET ivfflat.probes TO 10;
SET ivfflat.probes TO 20;
SELECT * FROM emb_table_1 WHERE ..
SELECT * FROM emb_table_2 WHERE ..

@pashkinelfe
Copy link
Contributor

pashkinelfe commented Sep 1, 2023

Can the syntax like
select * from emb_table where xx with (ivfflat.probes=10) as mentioned in a comment above
be implemented inside the extension at all? Doesn't it need modification of PG parser/planner, which is external relative to index.

As for probes/ef_search as an index option I think it's a good idea.

@xfalcox
Copy link

xfalcox commented Jan 11, 2024

@jkatz is this something you still want to pursue? This would be great to simplify my use case at @discourse where we automated the index creation, but setting this parameter is still brittle.

@jkatz
Copy link
Contributor Author

jkatz commented Jan 11, 2024

@xfalcox Yes, I'd be interested in pursuing it -- the RFC is out there to collect use cases before adding to the code 😄 I don't think a PR would be too difficult.

xfalcox added a commit to discourse/discourse-ai that referenced this issue Jan 12, 2024
Fixes a regression from 140359c which caused we to set this globally based on post count, rendering the cost of an index scan on the topics table too high and making the planner, correctly, not use the index anymore.

Hopefully pgvector/pgvector#235 lands soon.
xfalcox added a commit to discourse/discourse-ai that referenced this issue Jan 12, 2024
Fixes a regression from 140359c which caused we to set this globally based on post count, rendering the cost of an index scan on the topics table too high and making the planner, correctly, not use the index anymore.

Hopefully pgvector/pgvector#235 lands soon.
@khalidMindee
Copy link

Just to upvote on the use case n°1 mentioned (#235 (comment)) where multiple tables are there in the same database that uses different default probes parameters instead of specifiying the probes in query time to avoid the associated overhead

BEGIN;
SET LOCAL ivfflat.probes = 10;
SELECT ...
COMMIT;

@xfalcox
Copy link

xfalcox commented Feb 28, 2024

@xfalcox Yes, I'd be interested in pursuing it -- the RFC is out there to collect use cases before adding to the code 😄 I don't think a PR would be too difficult.

We keep getting reports of problems caused by the lack of this feature.

On @discourse we automatically create vectors for new topics and posts, and since this is based on dynamic user generated content, we have a background task that updates the index and the ivfflat.probes parameter accordingly in the background.

For the parameter, we set it following the instructions from @ankane here, however we received bug reports from people running with a user without SUPERUSER permissions, which breaks setting the parameter with ALTER DATABASE.

We also have multiple vector tables, and each one needs a different parameter so, once again, I'd love to see this parameter default feature become a reality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

6 participants