Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lack of result when selecting data without limit #553

Closed
zhrt123 opened this issue May 10, 2024 · 1 comment
Closed

Lack of result when selecting data without limit #553

zhrt123 opened this issue May 10, 2024 · 1 comment

Comments

@zhrt123
Copy link

zhrt123 commented May 10, 2024

Backgroud

For Perl test 017_hnsw_filtering.pl and 009_ivfflat_filtering.pl, if the number of rows is small (less than 1000), the results of tests are different from the expected results. (expect Seq scan, but get Index scan.)

PostgreSQL version: 16.1
pgvector version: 0.7.0

Problem

I executed the following SQL statements:

CREATE EXTENSION vector;
CREATE TABLE tst (i int4, v vector(3), c int4, t text);
INSERT INTO tst SELECT i, ARRAY[2 + random(),3 + random(),4 + random()], i % 50, 'test ' || i FROM generate_series(1, 1000) i;
CREATE INDEX idx ON tst USING hnsw (v  vector_l2_ops);
ANALYZE tst;

and then analyze this SQL statement:

postgres=# EXPLAIN ANALYZE SELECT i FROM tst WHERE v <-> '[1,2,3]' > 1 order by v <-> '[1,2,3]';  
                                                        QUERY PLAN                                                            
----------------------------------------------------------------------------------------------------------------------------  
Gather Motion 3:1  (slice1; segments: 3)  (cost=4.48..133.70 rows=3333 width=12) (actual time=2.017..2.019 rows=0 loops=1)  
  Merge Key: ((v <-> '[1,2,3]'::vector))  
  ->  Index Scan using idx on tst  (cost=4.48..89.26 rows=1111 width=12) (actual time=0.000..1.209 rows=0 loops=1)  
        Order By: (v <-> '[1,2,3]'::vector)  
        Filter: ((v <-> '[1,2,3]'::vector) < '1'::double precision)  
Optimizer: Postgres-based planner  
Planning Time: 1.154 ms  
  (slice0)    Executor memory: 41K bytes.  
  (slice1)    Executor memory: 269K bytes avg x 3 workers, 284K bytes max (seg1).  
Memory used:  128000kB  
Execution Time: 3.202 ms  
(11 rows)

The planner wants to execute the query with Index scan. However, our index hnsw limited the output to 40 candidates, which caused us to lose some results of the select query. In my experiment:

  • return 40 rows with Index scan.
  • return 1000 rows with Seq scan.

Result comparison

postgres=# EXPLAIN ANALYZE SELECT i FROM tst WHERE v <-> '[1,2,3]' > 1 order by v <-> '[1,2,3]';  
                                              QUERY PLAN                                               
-------------------------------------------------------------------------------------------------------
 Sort  (cost=3.41..3.50 rows=33 width=12) (actual time=0.218..0.235 rows=100 loops=1)
   Sort Key: ((v <-> '[1,2,3]'::vector))
   Sort Method: quicksort  Memory: 28kB
   ->  Seq Scan on tst  (cost=0.00..2.58 rows=33 width=12) (actual time=0.025..0.122 rows=100 loops=1)
         Filter: ((v <-> '[1,2,3]'::vector) > '1'::double precision)
 Planning Time: 0.215 ms
 Execution Time: 0.294 ms
(7 rows)

postgres=# SELECT i FROM tst WHERE v <-> '[1,2,3]' > 1 order by v <-> '[1,2,3]';  
(1000 rows)

postgres=# set enable_seqscan = off;
SET

postgres=# EXPLAIN ANALYZE SELECT i FROM tst WHERE v <-> '[1,2,3]' > 1 order by v <-> '[1,2,3]';  
                                                  QUERY PLAN                                                  
--------------------------------------------------------------------------------------------------------------
 Index Scan using idx on tst  (cost=12.36..17.94 rows=33 width=12) (actual time=0.428..0.462 rows=40 loops=1)
   Order By: (v <-> '[1,2,3]'::vector)
   Filter: ((v <-> '[1,2,3]'::vector) > '1'::double precision)
 Planning Time: 0.173 ms
 Execution Time: 0.508 ms
(5 rows)

postgres=# SELECT i FROM tst WHERE v <-> '[1,2,3]' > 1 order by v <-> '[1,2,3]';  
(40 rows)

Conclusion

We can use SET enable_indexscan = off to avoid using Index Scan for select queries without limit. But the Perl test
may mislead the developer that pgvector won't use Index Scan in this scene.

# Test distance filtering without limit
$explain = $node->safe_psql("postgres", qq(
	EXPLAIN ANALYZE SELECT i FROM tst WHERE v <-> '$query' < 1 ORDER BY v <-> '$query';
));
like($explain, qr/Seq Scan/);
@ankane
Copy link
Member

ankane commented Jun 1, 2024

Hi @zhrt123, I'm not able to reproduce the results in the "Problem" section, but it appears you're using a modified version of Postgres based on the EXPLAIN output.

@ankane ankane closed this as completed Jun 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants