Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for vectorized/batch inference? #18

Open
Smu-Tan opened this issue Mar 1, 2022 · 6 comments
Open

Support for vectorized/batch inference? #18

Smu-Tan opened this issue Mar 1, 2022 · 6 comments

Comments

@Smu-Tan
Copy link

Smu-Tan commented Mar 1, 2022

Hi, Im just wondering is there any method that can speed up the retrieval process? for example, vectorized or batch inference? (it means do the retrieval for a batch/a list of query at the same time).

Since Im trying to use the bm25 to retrieve the top n docs for large data(retrieve over 10k query from 50k docs), and if I do this by calling bm25.get_top_n() in a for loop, the inference time will be unacceptable long.

@dorianbrown
Copy link
Owner

Have you checked out the get_batch_scores method yet? It sounds like this might be what you're looking for.

@Smu-Tan
Copy link
Author

Smu-Tan commented Mar 2, 2022

Have you checked out the get_batch_scores method yet? It sounds like this might be what you're looking for.

I think get_batch_scores is to compute the bm25 scores between one query and a subset of the corpus? what I need is to compute the bm25 scores between a list of queries and the corpus. And because the query list is very huge(10k queries), then computing them is very slow.

@puzzlecollector
Copy link

Is this problem resolved? I am having the same sort of issue. I have 50k queries and it takes a long time (for me 150k seconds approx or almost 42 hrs) to compute.

@wise-east
Copy link

@Smu-Tan @puzzlecollector were you able to find an alternative to this implementation to speed up the process?

@Smu-Tan
Copy link
Author

Smu-Tan commented Sep 2, 2022

@Smu-Tan @puzzlecollector were you able to find an alternative to this implementation to speed up the process?

checkout Pyserini.

@AmenRa
Copy link

AmenRa commented Nov 17, 2022

Hi @Smu-Tan, @puzzlecollector, and @wise-east,

I have just released a new Python-based search engine called retriv.
It only takes ~40ms to query 8M documents on my machine, and it can perform multiple searches in parallel.
If you try it, please, let me know if it works for your use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants