Update cost estimation to not use index when expected tuples is too low #424

ankane · 2024-01-21T20:48:39Z

Hi all, wanted to get feedback on this change.

This PR updates cost estimation to not an index if the expected number of tuples to be returned is less than the number requested by the user. A few situations where this happens are:

A large % of rows being filtered - No results when using index #263
Limit + offset > ef_search (HNSW) - SELECT query not using (HNSW) index #396
Limit + offset > probes * vectors per list (IVFFlat) - Does ivvflat index has max query result limit? #405

See TAP tests 018 through 021 for specific queries. The costs are set so users can still override this with SET enable_seqscan = off; (except in the case of no LIMIT).

Let me know if you have any thoughts.

ankane · 2024-01-21T21:13:40Z

src/hnsw.c

+ {
+ RestrictInfo *rinfo = lfirst(lc);
+
+ if (rinfo->norm_selec >= 0 && rinfo->norm_selec <= 1 && rinfo->norm_selec != (Selectivity) DEFAULT_INEQ_SEL)


The DEFAULT_INEQ_SEL check ensures the index is still used for WHERE v <-> '[1,2,3]' < 1 (when there's an ORDER BY and LIMIT). This also means it's used for WHERE v <-> '[1,2,3]' > 1, which shouldn't really use an index (however, it currently does in existing releases).

Is there a case where rinfo->norm_selec falls outside of [0.0,1.0]?

From pathnodes.h:

selectivity for "normal" (JOIN_INNER) semantics; -1 if not yet set; >1 means a redundant clause

I'm not sure if there are cases where it'll be out of that range when they get to this function, but seems better to be safe.

jkatz · 2024-01-21T22:37:49Z

src/hnsw.c

+ *indexStartupCost = 1.0e10 - 1;
+ *indexTotalCost = 1.0e10 - 1;


Suggestion: a comment that this value is derived from disable_cost in src/backend/optimizer/path/costsize.c, should that value ever change. Perhaps also move it up into a const at the top of this file or one of the shared headers.

Sounds good - will add more comments / make const once we decide on an approach.

jkatz · 2024-01-21T23:08:35Z

Left some minor stylistic comments.

I'm still thinking through this one and need to test. I primarily stared at HNSW for a bit. I agree with the initial clause that if there's no limit, then don't use the index. However, I'm still grappling with a few concepts:

Selectivity is low but the column in question is not indexed, which could yield a faster pgvector index scan. OTOH, that still could yield results that are not relevant. There's an argument that the other column should be indexed, but we can't necessarily assume that doing so makes sense for that domain. This leads to...
Impact on HQANN, where in theory selectivity is low but we want a targeted scan through the index

I think for a bunch of cases, HQANN would help produce a fast in absence of an alternative filter. I saw an example with a LIKE '%...%' where that selectivity score would matter more. But as mentioned, I'm still thinking / looking for ideas on the above.

hlinnaka · 2024-01-22T00:22:37Z

This approach seems pretty inflexible to me. Off the top of my head, some scenarios where this can go wrong:

There might be a LIMIT at the top of the query, but the index scan is part of a larger plan. Consider a partitioned table, for example. The hnsw.ef_search parameter would apply to each scan separately, so the combined end result can have many more results, which might be fine. Or UNION. Or a join that produces many output rows for each row from the index scan.
There might be joins which filter rows, not just plain quals on the table itself.
LIMIT can be a parameter, e.g. LIMIT $1, not known at planning time.
There is no LIMIT, but the the client only fetches X first rows from the cursor.
Selectivity estimates can be very poor. If the planner estimates that a clause is very selective, but in reality it's not, this would heavily favor a seqscan even though an index scan would be appropriate. This is of course always a problem with planning any query, but this PR introduces a very strong bias on what is considered.

ankane · 2024-01-22T01:14:47Z

Thanks for the feedback.

@jkatz The issue with low selectivity using a vector index is the query will likely return few or no results, which happens now (#263). Also, HQANN should be able to have its own logic if needed.

@hlinnaka Will try to do some testing around 1 - 3. I think it's probably fine to not use an index for 4 (users who want to use an index can add a limit) and 5 (users can run ANALYZE or use extended statistics to try and fix).

ankane · 2024-01-22T01:19:24Z

Apart from the specific implementation, I'd be curious to hear thoughts on whether the test cases make sense. For instance, does it make sense to use an index for the following query with the default ef_search of 40?

SELECT * FROM tst ORDER BY v <-> '[1,2,3]' LIMIT 41; -- or 10000

jkatz · 2024-01-22T02:11:52Z

The issue with low selectivity using a vector index is the query will likely return few or no results, which happens now (#263).

@ankane I understand that -- but I'm not sure if measuring on selectivity alone will account for this. For example, a selectivity of 0.05 on a dataset of 100K has a far higher chance of not returning enough tuples vs. a dataset of 100MM.

Apart from the specific implementation, I'd be curious to hear thoughts on whether the test cases make sense. For instance, does it make sense to use an index for the following query with the default ef_search of 40?
SELECT * FROM tst ORDER BY v <-> '[1,2,3]' LIMIT 41; -- or 10000

🤔 I wonder if another approach would be to error when there's a LIMIT/ef_search mismatch.

ankane · 2024-01-22T02:31:36Z

I'm not sure if measuring on selectivity alone will account for this. For example, a selectivity of 0.05 on a dataset of 100K has a far higher chance of not returning enough tuples vs. a dataset of 100MM.

The odds should be the same with the same ef_search. The index will find ef_search results, and only 5% of them will match.

I wonder if another approach would be to error when there's a LIMIT/ef_search mismatch.

That'd be another option (but not sure I like it as much as the others).

jkatz · 2024-01-22T04:43:48Z

The odds should be the same with the same ef_search. The index will find ef_search results, and only 5% of them will match.

I'm not following. The selectivity in the patch is being measured from the reduction in tuples in the table based on the non-IVFFLAT/HNSW filters. I do agree that the odds are the same if we're only considering selectivity. But if we're considering the actual cardinality, the odds do change. In the example I gave, 5% of 100K is 5K, whereas 5% of 100MM is 5MM, which is a far greater set of tuples to search from.

But -- and as this patch would address -- if the filtering is occurring as a result of the index, then flipping to a different index scan would increase the number of tuples returned, though possibly at a performance cost.

I wonder if another approach would be to error when there's a LIMIT/ef_search mismatch.
That'd be another option (but not sure I like it as much as the others).

This at least gives a deterministic error that an app developer can adapt to. This would be encountered at programming time, and the developer can then adapt the query to use a higher hnsw.ef_search value, which could be adapted into the code. This avoids a "surprise" behavior in a query plan flip that the developer doesn't expect, and as a result has to do additional debugging.

ankane added 2 commits January 21, 2024 12:39

Updated cost estimation to not use index when expected tuples is too low

f133e84

Fixed check for i386 [skip ci]

a3b4e65

ankane mentioned this pull request Jan 21, 2024

No results when using index #263

Open

ankane commented Jan 21, 2024

View reviewed changes

jkatz reviewed Jan 21, 2024

View reviewed changes

Added more filtering tests [skip ci]

44e9e26

Added tests for like [skip ci]

cc8702c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update cost estimation to not use index when expected tuples is too low #424

Update cost estimation to not use index when expected tuples is too low #424

ankane commented Jan 21, 2024

ankane Jan 21, 2024

jkatz Jan 21, 2024

ankane Jan 22, 2024

jkatz Jan 21, 2024 •

edited

ankane Jan 22, 2024

jkatz commented Jan 21, 2024

hlinnaka commented Jan 22, 2024

ankane commented Jan 22, 2024

ankane commented Jan 22, 2024

jkatz commented Jan 22, 2024

ankane commented Jan 22, 2024

jkatz commented Jan 22, 2024

		*indexStartupCost = 1.0e10 - 1;
		*indexTotalCost = 1.0e10 - 1;

Update cost estimation to not use index when expected tuples is too low #424

Are you sure you want to change the base?

Update cost estimation to not use index when expected tuples is too low #424

Conversation

ankane commented Jan 21, 2024

ankane Jan 21, 2024

Choose a reason for hiding this comment

jkatz Jan 21, 2024

Choose a reason for hiding this comment

ankane Jan 22, 2024

Choose a reason for hiding this comment

jkatz Jan 21, 2024 • edited

Choose a reason for hiding this comment

ankane Jan 22, 2024

Choose a reason for hiding this comment

jkatz commented Jan 21, 2024

hlinnaka commented Jan 22, 2024

ankane commented Jan 22, 2024

ankane commented Jan 22, 2024

jkatz commented Jan 22, 2024

ankane commented Jan 22, 2024

jkatz commented Jan 22, 2024

jkatz Jan 21, 2024 •

edited