Can I fed 500K documents in rank_bm25? #27

ramsey-coding · 2022-08-25T03:11:52Z

Thanks for this awesome library.

I am curious to know whether rank_bm25 can handle 500K documents. Each document has around 1000 words.

Looking forward to your feedback. I want to use the following functionality with rank_bm25:

from rank_bm25 import BM25Okapi

corpus = [
    "Hello there good man!",
    "It is quite windy in London",
    "How is the weather today?"
]

tokenized_corpus = [doc.split(" ") for doc in corpus]
bm25 = BM25Okapi(tokenized_corpus)


query = "windy London"
tokenized_query = query.split(" ")

doc_scores = bm25.get_scores(tokenized_query)
result = bm25.get_top_n(tokenized_query, corpus, n=1)

print(result)

The text was updated successfully, but these errors were encountered:

ramsey-coding · 2022-08-26T09:56:13Z

@Witiko can you please provide any insight?

Witiko · 2022-08-26T10:42:19Z

@ramsey-coding I don't see a reason why it shouldn't. Have you tried?

ramsey-coding · 2022-08-27T07:45:41Z

@Witiko the problem is call to the bm25.get_top_n is very very slow :-(

It is taking ~5 second per call.

ramsey-coding · 2022-08-27T07:59:53Z

@dorianbrown the library is slow to retrieval from ~350K samples. Can you please guide what to do here?

AmenRa · 2022-11-17T16:56:28Z

Hi @ramsey-coding,

I have just released a new Python-based search engine called retriv.
It only takes ~40ms to query 8M documents on my machine.
If you try it, please, let me know if it works for your use case.

nashid · 2022-11-17T21:00:48Z

@AmenRa I am also interested in this feature. Would try out retriv.

nocoolsandwich · 2023-04-19T02:59:18Z

Better use elastichsearch.Python version can be slow makes you crazy

AmenRa · 2023-04-19T09:36:14Z

@nocoolsandwich

You should try my library retriv.
It takes 10 ms to search 10 million documents with BM25.

Witiko mentioned this issue Aug 29, 2022

Implement Okapi BM25 variants in Gensim piskvorky/gensim#3304

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can I fed 500K documents in rank_bm25? #27

Can I fed 500K documents in rank_bm25? #27

ramsey-coding commented Aug 25, 2022

ramsey-coding commented Aug 26, 2022

Witiko commented Aug 26, 2022

ramsey-coding commented Aug 27, 2022

ramsey-coding commented Aug 27, 2022

AmenRa commented Nov 17, 2022

nashid commented Nov 17, 2022 •

edited

nocoolsandwich commented Apr 19, 2023

AmenRa commented Apr 19, 2023

Can I fed 500K documents in rank_bm25? #27

Can I fed 500K documents in rank_bm25? #27

Comments

ramsey-coding commented Aug 25, 2022

ramsey-coding commented Aug 26, 2022

Witiko commented Aug 26, 2022

ramsey-coding commented Aug 27, 2022

ramsey-coding commented Aug 27, 2022

AmenRa commented Nov 17, 2022

nashid commented Nov 17, 2022 • edited

nocoolsandwich commented Apr 19, 2023

AmenRa commented Apr 19, 2023

nashid commented Nov 17, 2022 •

edited