Debug BM25Okapi #26

LowinLi · 2022-08-04T05:19:48Z

In the "BM25Okapi" function "_calc_idfIf", if average_idf is negative, the eps will be negative, so the BM25 score also will be negative. So this commit will debug this error.

In "BM25Okapi" function "_calc_idfIf", if average_idf is negative, the eps will be negative, so BM25 score also will be negative. So this commit want be debug this error.

dorianbrown · 2024-05-28T13:39:52Z

I think I finally found where this motivation came from, namely this section from here:

Please note that the IDF formula listed above has a drawback when using it for terms appearing in more than half of the corpus since the value would come out as negative value, resulting in the overall score to become negative. e.g. if we have 10 documents in the corpus, and the term "the" appeared in 6 of them, its IDF would be log(10−6+0.5/6+0.5)=log(4.5/6.5).

Although we can argue that our implementation should have already removed these frequently appearing words as these words are mostly used to form a complete sentence and carry little meaning of note, different softwares/packages still make different adjustments to prevent a negative score from ever occurring. e.g.

Add a 1 to the equation. IDF(qi)=log(1+N−N(qi)+0.5N(qi)+0.5)
For term that resulted in a negative IDF value, swap it with an small positive value, usually denoted as epsilon

dorianbrown · 2024-05-28T13:44:11Z

I wonder if it might be more simple to just go with the "smoothed" IDF function IDF(qi)=log(1+N−N(qi)+0.5N(qi)+0.5), which ensures that IDFs are always positive. That way we don't have to do all this checking for negativity stuff.

What do you think?

Debug BM25Okapi

b150eb9

In "BM25Okapi" function "_calc_idfIf", if average_idf is negative, the eps will be negative, so BM25 score also will be negative. So this commit want be debug this error.

dorianbrown self-requested a review May 28, 2024 12:19

dorianbrown mentioned this pull request May 28, 2024

fix issue 39: Score is 0 when a token is in exactly 50% of the documents. #40

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Debug BM25Okapi #26

Debug BM25Okapi #26

LowinLi commented Aug 4, 2022

dorianbrown commented May 28, 2024 •

edited

dorianbrown commented May 28, 2024

Debug BM25Okapi #26

Are you sure you want to change the base?

Debug BM25Okapi #26

Conversation

LowinLi commented Aug 4, 2022

dorianbrown commented May 28, 2024 • edited

dorianbrown commented May 28, 2024

dorianbrown commented May 28, 2024 •

edited