Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Querying a table with jaccard similarity with 1.6 million records take 12 seconds #32

Open
crbsram opened this issue Sep 2, 2021 · 2 comments
Labels

Comments

@crbsram
Copy link

crbsram commented Sep 2, 2021

Hi Team

We have encountered a problem in our testing environment.
I have a scenario where I am running a similarity match for an address
I have created a table with following number of records
1603423
And I have created GIN index (since i am using pg_similarity) library jaccard similarity method
when I run the Explain analyze
EXPLAIN ANALYZE select complete_address, jaccard(complete_address, 'raj nagar ext near ajnara integrity up ghaziabad 201017') as qsim from address where complete_address~??'raj nagar ext near ajnara integrity up ghaziabad 201017' order by qsim DESC;
QUERY PLAN

Sort (cost=5856.13..5860.14 rows=1603 width=93) (actual time=12101.194..12101.197 rows=6 loops=1)
Sort Key: (jaccard(complete_address, 'raj nagar ext near ajnara integrity up ghaziabad 201017'::text)) DESC
Sort Method: quicksort Memory: 25kB
-> Bitmap Heap Scan on address (cost=172.43..5770.80 rows=1603 width=93) (actual time=3516.233..12101.172 rows=6 loops=1)
Recheck Cond: (complete_address ~?? 'raj nagar ext near ajnara integrity up ghaziabad 201017'::text)
Rows Removed by Index Recheck: 1039186
Heap Blocks: exact=58018
-> Bitmap Index Scan on address_complete_address_idx1 (cost=0.00..172.02 rows=1603 width=0) (actual time=256.037..256.037 rows=1039192 loops=1)
Index Cond: (complete_address ~?? 'raj nagar ext near ajnara integrity up ghaziabad 201017'::text)
Planning Time: 0.141 ms
Execution Time: 12101.245 ms
(11 rows)
it took 12 seconds
following are my pgconf file settings
shared buffer as 4GB
work_mem 256 MB
maintenence_work_mem 512MB
autovacuum_work_mem 20MB
My index definition is this "address_complete_address_idx1" gin (complete_address gin_similarity_ops)

It is taking the index correctly. But why it took 12 seconds to process I really don't understand.

Please help.

Thanks
C.R.Bala

@eulerto
Copy link
Owner

eulerto commented Sep 5, 2021

Your WHERE clause has low selectivity (1039192 of 1603423 rows), hence, Bitmap Heap Scan has to read and discard more than 1 million rows (actual time=3516.233..12101.172).

If you add BUFFERS as an EXPLAIN option, it shows how many rows pages it is reading.

@crbsram
Copy link
Author

crbsram commented Sep 6, 2021

Hi Eulerto

Thanks for your reply.

Please find the stats as given below:
dedupechecksample=# EXPLAIN (ANALYZE, BUFFERS) select complete_address, jaccard(complete_address, 'raj nagar ext near ajnara integrity up ghaziabad 201017') as qsim from address where complete_address~??'raj nagar ext near ajnara integrity up ghaziabad 201017' order by qsim DESC;
QUERY PLAN

Sort (cost=5856.13..5860.14 rows=1603 width=93) (actual time=12023.801..12023.803 rows=0 loops=1)
Sort Key: (jaccard(complete_address, 'raj nagar ext near ajnara integrity up ghaziabad 201017'::text)) DESC
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=58306
-> Bitmap Heap Scan on address20 (cost=172.43..5770.80 rows=1603 width=93) (actual time=12023.793..12023.794 rows=0 loops=1)
Recheck Cond: (complete_address ~?? 'raj nagar ext near ajnara integrity up ghaziabad 201017'::text)
Rows Removed by Index Recheck: 1039192
Heap Blocks: exact=58018
Buffers: shared hit=58306
-> Bitmap Index Scan on address_complete_address_idx1 (cost=0.00..172.02 rows=1603 width=0) (actual time=260.443..260.443 rows=1039192 loops=1)
Index Cond: (complete_address ~?? 'raj nagar ext near ajnara integrity up ghaziabad 201017'::text)
Buffers: shared hit=288
Planning:
Buffers: shared hit=1
Planning Time: 0.129 ms
Execution Time: 12025.412 ms
(16 rows)

Am I missing anything here?
If I need to improve the performance any other settings has to be done.
Please help here.
If this can't be improved they are thinking of alternatives to use faiss library.

Thanks
C.R.Bala

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants