-
Notifications
You must be signed in to change notification settings - Fork 438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update cost estimation to not use index when expected tuples is too low #424
base: master
Are you sure you want to change the base?
Conversation
{ | ||
RestrictInfo *rinfo = lfirst(lc); | ||
|
||
if (rinfo->norm_selec >= 0 && rinfo->norm_selec <= 1 && rinfo->norm_selec != (Selectivity) DEFAULT_INEQ_SEL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DEFAULT_INEQ_SEL
check ensures the index is still used for WHERE v <-> '[1,2,3]' < 1
(when there's an ORDER BY
and LIMIT
). This also means it's used for WHERE v <-> '[1,2,3]' > 1
, which shouldn't really use an index (however, it currently does in existing releases).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a case where rinfo->norm_selec
falls outside of [0.0,1.0]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From pathnodes.h
:
selectivity for "normal" (JOIN_INNER) semantics; -1 if not yet set; >1 means a redundant clause
I'm not sure if there are cases where it'll be out of that range when they get to this function, but seems better to be safe.
*indexStartupCost = 1.0e10 - 1; | ||
*indexTotalCost = 1.0e10 - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: a comment that this value is derived from disable_cost
in src/backend/optimizer/path/costsize.c
, should that value ever change. Perhaps also move it up into a const at the top of this file or one of the shared headers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good - will add more comments / make const once we decide on an approach.
Left some minor stylistic comments. I'm still thinking through this one and need to test. I primarily stared at HNSW for a bit. I agree with the initial clause that if there's no limit, then don't use the index. However, I'm still grappling with a few concepts:
I think for a bunch of cases, HQANN would help produce a fast in absence of an alternative filter. I saw an example with a |
This approach seems pretty inflexible to me. Off the top of my head, some scenarios where this can go wrong:
|
Thanks for the feedback. @jkatz The issue with low selectivity using a vector index is the query will likely return few or no results, which happens now (#263). Also, HQANN should be able to have its own logic if needed. @hlinnaka Will try to do some testing around 1 - 3. I think it's probably fine to not use an index for 4 (users who want to use an index can add a limit) and 5 (users can run |
Apart from the specific implementation, I'd be curious to hear thoughts on whether the test cases make sense. For instance, does it make sense to use an index for the following query with the default SELECT * FROM tst ORDER BY v <-> '[1,2,3]' LIMIT 41; -- or 10000 |
@ankane I understand that -- but I'm not sure if measuring on selectivity alone will account for this. For example, a selectivity of 0.05 on a dataset of 100K has a far higher chance of not returning enough tuples vs. a dataset of 100MM.
🤔 I wonder if another approach would be to error when there's a LIMIT/ef_search mismatch. |
The odds should be the same with the same
That'd be another option (but not sure I like it as much as the others). |
I'm not following. The selectivity in the patch is being measured from the reduction in tuples in the table based on the non-IVFFLAT/HNSW filters. I do agree that the odds are the same if we're only considering selectivity. But if we're considering the actual cardinality, the odds do change. In the example I gave, 5% of 100K is 5K, whereas 5% of 100MM is 5MM, which is a far greater set of tuples to search from. But -- and as this patch would address -- if the filtering is occurring as a result of the index, then flipping to a different index scan would increase the number of tuples returned, though possibly at a performance cost.
This at least gives a deterministic error that an app developer can adapt to. This would be encountered at programming time, and the developer can then adapt the query to use a higher |
Hi all, wanted to get feedback on this change.
This PR updates cost estimation to not an index if the expected number of tuples to be returned is less than the number requested by the user. A few situations where this happens are:
See TAP tests 018 through 021 for specific queries. The costs are set so users can still override this with
SET enable_seqscan = off;
(except in the case of noLIMIT
).Let me know if you have any thoughts.