Lack of result when selecting data without limit #553

zhrt123 · 2024-05-10T07:12:59Z

Backgroud

For Perl test 017_hnsw_filtering.pl and 009_ivfflat_filtering.pl, if the number of rows is small (less than 1000), the results of tests are different from the expected results. (expect Seq scan, but get Index scan.)

PostgreSQL version: 16.1
pgvector version: 0.7.0

Problem

I executed the following SQL statements:

CREATE EXTENSION vector;
CREATE TABLE tst (i int4, v vector(3), c int4, t text);
INSERT INTO tst SELECT i, ARRAY[2 + random(),3 + random(),4 + random()], i % 50, 'test ' || i FROM generate_series(1, 1000) i;
CREATE INDEX idx ON tst USING hnsw (v  vector_l2_ops);
ANALYZE tst;

and then analyze this SQL statement:

postgres=# EXPLAIN ANALYZE SELECT i FROM tst WHERE v <-> '[1,2,3]' > 1 order by v <-> '[1,2,3]';  
                                                        QUERY PLAN                                                            
----------------------------------------------------------------------------------------------------------------------------  
Gather Motion 3:1  (slice1; segments: 3)  (cost=4.48..133.70 rows=3333 width=12) (actual time=2.017..2.019 rows=0 loops=1)  
  Merge Key: ((v <-> '[1,2,3]'::vector))  
  ->  Index Scan using idx on tst  (cost=4.48..89.26 rows=1111 width=12) (actual time=0.000..1.209 rows=0 loops=1)  
        Order By: (v <-> '[1,2,3]'::vector)  
        Filter: ((v <-> '[1,2,3]'::vector) < '1'::double precision)  
Optimizer: Postgres-based planner  
Planning Time: 1.154 ms  
  (slice0)    Executor memory: 41K bytes.  
  (slice1)    Executor memory: 269K bytes avg x 3 workers, 284K bytes max (seg1).  
Memory used:  128000kB  
Execution Time: 3.202 ms  
(11 rows)

The planner wants to execute the query with Index scan. However, our index hnsw limited the output to 40 candidates, which caused us to lose some results of the select query. In my experiment:

return 40 rows with Index scan.
return 1000 rows with Seq scan.

Result comparison

postgres=# EXPLAIN ANALYZE SELECT i FROM tst WHERE v <-> '[1,2,3]' > 1 order by v <-> '[1,2,3]';  
                                              QUERY PLAN                                               
-------------------------------------------------------------------------------------------------------
 Sort  (cost=3.41..3.50 rows=33 width=12) (actual time=0.218..0.235 rows=100 loops=1)
   Sort Key: ((v <-> '[1,2,3]'::vector))
   Sort Method: quicksort  Memory: 28kB
   ->  Seq Scan on tst  (cost=0.00..2.58 rows=33 width=12) (actual time=0.025..0.122 rows=100 loops=1)
         Filter: ((v <-> '[1,2,3]'::vector) > '1'::double precision)
 Planning Time: 0.215 ms
 Execution Time: 0.294 ms
(7 rows)

postgres=# SELECT i FROM tst WHERE v <-> '[1,2,3]' > 1 order by v <-> '[1,2,3]';  
(1000 rows)

postgres=# set enable_seqscan = off;
SET

postgres=# EXPLAIN ANALYZE SELECT i FROM tst WHERE v <-> '[1,2,3]' > 1 order by v <-> '[1,2,3]';  
                                                  QUERY PLAN                                                  
--------------------------------------------------------------------------------------------------------------
 Index Scan using idx on tst  (cost=12.36..17.94 rows=33 width=12) (actual time=0.428..0.462 rows=40 loops=1)
   Order By: (v <-> '[1,2,3]'::vector)
   Filter: ((v <-> '[1,2,3]'::vector) > '1'::double precision)
 Planning Time: 0.173 ms
 Execution Time: 0.508 ms
(5 rows)

postgres=# SELECT i FROM tst WHERE v <-> '[1,2,3]' > 1 order by v <-> '[1,2,3]';  
(40 rows)

Conclusion

We can use SET enable_indexscan = off to avoid using Index Scan for select queries without limit. But the Perl test
may mislead the developer that pgvector won't use Index Scan in this scene.

# Test distance filtering without limit
$explain = $node->safe_psql("postgres", qq(
	EXPLAIN ANALYZE SELECT i FROM tst WHERE v <-> '$query' < 1 ORDER BY v <-> '$query';
));
like($explain, qr/Seq Scan/);

The text was updated successfully, but these errors were encountered:

ankane · 2024-06-01T15:49:37Z

Hi @zhrt123, I'm not able to reproduce the results in the "Problem" section, but it appears you're using a modified version of Postgres based on the EXPLAIN output.

ankane closed this as completed Jun 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lack of result when selecting data without limit #553

Lack of result when selecting data without limit #553

zhrt123 commented May 10, 2024 •

edited

ankane commented Jun 1, 2024

Lack of result when selecting data without limit #553

Lack of result when selecting data without limit #553

Comments

zhrt123 commented May 10, 2024 • edited

Backgroud

Problem

Result comparison

Conclusion

ankane commented Jun 1, 2024

zhrt123 commented May 10, 2024 •

edited