Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update cost estimation to not use index when expected tuples is too low #424

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

ankane
Copy link
Member

@ankane ankane commented Jan 21, 2024

Hi all, wanted to get feedback on this change.

This PR updates cost estimation to not an index if the expected number of tuples to be returned is less than the number requested by the user. A few situations where this happens are:

  1. A large % of rows being filtered - No results when using index #263
  2. Limit + offset > ef_search (HNSW) - SELECT query not using (HNSW) index #396
  3. Limit + offset > probes * vectors per list (IVFFlat) - Does ivvflat index has max query result limit? #405

See TAP tests 018 through 021 for specific queries. The costs are set so users can still override this with SET enable_seqscan = off; (except in the case of no LIMIT).

Let me know if you have any thoughts.

{
RestrictInfo *rinfo = lfirst(lc);

if (rinfo->norm_selec >= 0 && rinfo->norm_selec <= 1 && rinfo->norm_selec != (Selectivity) DEFAULT_INEQ_SEL)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DEFAULT_INEQ_SEL check ensures the index is still used for WHERE v <-> '[1,2,3]' < 1 (when there's an ORDER BY and LIMIT). This also means it's used for WHERE v <-> '[1,2,3]' > 1, which shouldn't really use an index (however, it currently does in existing releases).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a case where rinfo->norm_selec falls outside of [0.0,1.0]?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From pathnodes.h:

selectivity for "normal" (JOIN_INNER) semantics; -1 if not yet set; >1 means a redundant clause

I'm not sure if there are cases where it'll be out of that range when they get to this function, but seems better to be safe.

Comment on lines +114 to +115
*indexStartupCost = 1.0e10 - 1;
*indexTotalCost = 1.0e10 - 1;
Copy link
Contributor

@jkatz jkatz Jan 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: a comment that this value is derived from disable_cost in src/backend/optimizer/path/costsize.c, should that value ever change. Perhaps also move it up into a const at the top of this file or one of the shared headers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good - will add more comments / make const once we decide on an approach.

@jkatz
Copy link
Contributor

jkatz commented Jan 21, 2024

Left some minor stylistic comments.

I'm still thinking through this one and need to test. I primarily stared at HNSW for a bit. I agree with the initial clause that if there's no limit, then don't use the index. However, I'm still grappling with a few concepts:

  • Selectivity is low but the column in question is not indexed, which could yield a faster pgvector index scan. OTOH, that still could yield results that are not relevant. There's an argument that the other column should be indexed, but we can't necessarily assume that doing so makes sense for that domain. This leads to...
  • Impact on HQANN, where in theory selectivity is low but we want a targeted scan through the index

I think for a bunch of cases, HQANN would help produce a fast in absence of an alternative filter. I saw an example with a LIKE '%...%' where that selectivity score would matter more. But as mentioned, I'm still thinking / looking for ideas on the above.

@hlinnaka
Copy link
Contributor

This approach seems pretty inflexible to me. Off the top of my head, some scenarios where this can go wrong:

  1. There might be a LIMIT at the top of the query, but the index scan is part of a larger plan. Consider a partitioned table, for example. The hnsw.ef_search parameter would apply to each scan separately, so the combined end result can have many more results, which might be fine. Or UNION. Or a join that produces many output rows for each row from the index scan.

  2. There might be joins which filter rows, not just plain quals on the table itself.

  3. LIMIT can be a parameter, e.g. LIMIT $1, not known at planning time.

  4. There is no LIMIT, but the the client only fetches X first rows from the cursor.

  5. Selectivity estimates can be very poor. If the planner estimates that a clause is very selective, but in reality it's not, this would heavily favor a seqscan even though an index scan would be appropriate. This is of course always a problem with planning any query, but this PR introduces a very strong bias on what is considered.

@ankane
Copy link
Member Author

ankane commented Jan 22, 2024

Thanks for the feedback.

@jkatz The issue with low selectivity using a vector index is the query will likely return few or no results, which happens now (#263). Also, HQANN should be able to have its own logic if needed.

@hlinnaka Will try to do some testing around 1 - 3. I think it's probably fine to not use an index for 4 (users who want to use an index can add a limit) and 5 (users can run ANALYZE or use extended statistics to try and fix).

@ankane
Copy link
Member Author

ankane commented Jan 22, 2024

Apart from the specific implementation, I'd be curious to hear thoughts on whether the test cases make sense. For instance, does it make sense to use an index for the following query with the default ef_search of 40?

SELECT * FROM tst ORDER BY v <-> '[1,2,3]' LIMIT 41; -- or 10000

@jkatz
Copy link
Contributor

jkatz commented Jan 22, 2024

The issue with low selectivity using a vector index is the query will likely return few or no results, which happens now (#263).

@ankane I understand that -- but I'm not sure if measuring on selectivity alone will account for this. For example, a selectivity of 0.05 on a dataset of 100K has a far higher chance of not returning enough tuples vs. a dataset of 100MM.

Apart from the specific implementation, I'd be curious to hear thoughts on whether the test cases make sense. For instance, does it make sense to use an index for the following query with the default ef_search of 40?

SELECT * FROM tst ORDER BY v <-> '[1,2,3]' LIMIT 41; -- or 10000

🤔 I wonder if another approach would be to error when there's a LIMIT/ef_search mismatch.

@ankane
Copy link
Member Author

ankane commented Jan 22, 2024

I'm not sure if measuring on selectivity alone will account for this. For example, a selectivity of 0.05 on a dataset of 100K has a far higher chance of not returning enough tuples vs. a dataset of 100MM.

The odds should be the same with the same ef_search. The index will find ef_search results, and only 5% of them will match.

I wonder if another approach would be to error when there's a LIMIT/ef_search mismatch.

That'd be another option (but not sure I like it as much as the others).

@jkatz
Copy link
Contributor

jkatz commented Jan 22, 2024

The odds should be the same with the same ef_search. The index will find ef_search results, and only 5% of them will match.

I'm not following. The selectivity in the patch is being measured from the reduction in tuples in the table based on the non-IVFFLAT/HNSW filters. I do agree that the odds are the same if we're only considering selectivity. But if we're considering the actual cardinality, the odds do change. In the example I gave, 5% of 100K is 5K, whereas 5% of 100MM is 5MM, which is a far greater set of tuples to search from.

But -- and as this patch would address -- if the filtering is occurring as a result of the index, then flipping to a different index scan would increase the number of tuples returned, though possibly at a performance cost.

I wonder if another approach would be to error when there's a LIMIT/ef_search mismatch.
That'd be another option (but not sure I like it as much as the others).

This at least gives a deterministic error that an app developer can adapt to. This would be encountered at programming time, and the developer can then adapt the query to use a higher hnsw.ef_search value, which could be adapted into the code. This avoids a "surprise" behavior in a query plan flip that the developer doesn't expect, and as a result has to do additional debugging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants