Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BM25 should not consider repeated query tokens #19

Open
Witiko opened this issue Mar 4, 2022 · 2 comments
Open

BM25 should not consider repeated query tokens #19

Witiko opened this issue Mar 4, 2022 · 2 comments

Comments

@Witiko
Copy link
Contributor

Witiko commented Mar 4, 2022

@dorianbrown In the seminal paper for this package, the Okapi at TREC-3 paper, and most other places, BM25 is defined over query terms rather than tokens, which would indicate that repeated query tokens should not impact the score. However, that does not seem to be the case in the rank-bm25 library:

for q in query:

This can be easily solved by the user by passing set(query)1 rather than query to the get_scores() method, but it seems as something that the user would expect to happen automatically. At the very least, we may want to document this.


1 Alternatively, list(dict.fromkeys(query)) for reproducible ordering, since floating point summation is not always associative.

@dzieciou
Copy link

dzieciou commented Oct 31, 2023

In Pyterrier/Terrier implementation of BM25 the number of times a query term is repeated matters. It is often used to upweight certain query terms. Look how the scoring changes:

image

@Witiko
Copy link
Contributor Author

Witiko commented Nov 4, 2023

@dzieciou Different implementations of BM25 take different liberties with the original algorithm, including (Py)Terrier. If the rank_bm25 library is to implement the algorithm as it was originally described, then it should treat the query terms as a set, not as a multiset.

However, I am satisfied with the behavior being documented, at least in an open issue on GitHub if not elsewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants