Retrieval Metrics: Updating HitRate and MRR for Evaluation@K documents retrieved. Also adding RR as a separate metric #12997

AgenP · 2024-04-21T13:15:47Z

Side notes:

Quick thanks to all the devs who've worked on LlamaIndex. It has been instrumental in supercharging my ability to build 💪
I'd love any feedback that I can iterate on, to help get this merged

Description

HitRate edit (HitRate@K)

Changed to allow for non-binary scoring --> To widen the evaluation value of this metric, from my perspective.
Example: 5/10 relevant docs retrieved scores 0.5 (instead of 1.0) and 2/10 would be 0.2 (instead of 1.0) --> Allowing for a much more detailed analysis of the amount of expected documents retrieved

RR edit

RR: The original implementation of MRR breaks out after one reciprocal rank calculation. So I renamed it RR for clarity

MRR edit (MRR@K)

MRR: MRR now calculates mean reciprocal rank score for all retrieved docs within the single call.
Example: 2 of of 3 docs retrieved are relevant, and are at the 1st and 3rd spot. This scores ( (1 + 1/3) / 2 = 0.67)
Idea - New name? More precisely could be called single query mean reciprocal rank (SQMRR) for clarity

Other changes made

Set data type used with expected ids, to increase speed of membership checks vs. a list (only in HitRate and MRR) --> Future implementation note: Should NOT be used in any metric where the order of the expected ids matters, since casting a list to a set may change the order of the elements
Error handling to also catch if empty lists are passed as args, for retrieved IDs or expected IDs (raises ValueError)
Removed unused parameters
Added RR to metric registry

Type of Change

New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Added new unit/integration tests
Added new notebook (that tests end-to-end)
I stared at the code and made sure it makes sense

Suggested Checklist:

I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added Google Colab support for the newly added notebooks.
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I ran make format; make lint to appease the lint gods

…aluation and new separate MRR implementation

nerdai · 2024-04-22T14:41:55Z

llama-index-core/llama_index/core/evaluation/retrieval/metrics.py

@@ -12,45 +12,51 @@


 class HitRate(BaseRetrievalMetric):
- """Hit rate metric."""
+ """Hit rate metric: Compute the proportion of matches between retrieved documents and expected documents."""


I like the metric, but I'm not sure I like it as a replacement to HitRate, since "Hit" implies a binary nature.

Perhaps we can call this MeanDocumentMatch or something that represents an average of proportions.

To be clear, I am recommending leaving HitRate as is and creating a net new BaseRetrievalMetric

That’s a good point. For the proposed implementation, would you not say that technically each document gets a binary representation in the score (since each doc is hit, or not hit)?

Then the summation and division allow for the accounting, and comparison, of multiple docs in the score.

Note: I realise this would alter the usage pattern, compared to past HitRate implementations, so that adds a cost to this idea!

However, in one metric, we could get the binary representation and also avoid the information lost by having only 1 or 0 for scoring options, regardless of the amount of docs used.

Let me know if you think this is worth it

Yeah, each doc definitely gets assigned a binary value to it i.e. hit or no hit. But yea the usage pattern would be broken here unfortunately if we go down to that level of granularity with HitRate.

Perhaps your suggesting some sort of parameter to control the calculation of the HitRate, which if that's the case I think would be fine. We could make the default calculation be the current computation, and allow the user to specify if they would like to use the more granular calculation.

Great idea!

I'll work on that

nerdai · 2024-04-22T14:42:20Z

llama-index-core/llama_index/core/evaluation/retrieval/metrics.py

@@ -12,45 +12,51 @@


 class HitRate(BaseRetrievalMetric):
- """Hit rate metric."""
+ """Hit rate metric: Compute the proportion of matches between retrieved documents and expected documents."""


To be clear, I am recommending leaving HitRate as is and creating a net new BaseRetrievalMetric

nerdai · 2024-04-22T14:43:54Z

llama-index-core/llama_index/core/evaluation/retrieval/metrics.py

 expected_ids: Optional[List[str]] = None,
 retrieved_ids: Optional[List[str]] = None,
- expected_texts: Optional[List[str]] = None,


I'm not entirely sure if these can be removed. Did you check if removing these args still satisfies the compute method signature for BaseRetrievalMetric?

Yep you are correct. I did some research and it would violate the Liskov substitution principle which is not the one, particularly with an ABC and its abstract methods.

When making the changes, I will ensure the method signatures are kept the same

Thanks for highlighting this

nerdai · 2024-04-22T14:46:02Z

llama-index-core/llama_index/core/evaluation/retrieval/metrics.py

+class MRR(BaseRetrievalMetric):
+ """Mean Reciprocal Rank (MRR): Sums up the reciprocal rank score for each relevant retrieved document.
+ Then divides by the count of relevant documents.
+ """
+


I think this will break the current way we invoke MRR to be computed. While I agree that RR and this MRR computation is more technically correct, I feel its a bit pedantic (and, I'm pretty pedantic myself!) and might not be worth the hassle of ensuring backwards compatibility and risking breaking changes.

Yeah that makes sense. I completely understand.

What do you think about having a net new metric called SQMRR (single query mean reciprocal rank)?

To boost the flexibility of your evaluation suite, for devs

Would SQMRR ultimately compute the MRR tho? i.e., after we take an average of all SQMRR?

Yeah, to a certain degree. It would be more granular version (since it scores based on multiple docs each time)

I could implement it as an optional implementation, using your parameter idea. Incase the user wants multiple docs scored each time

Let me know what you think

Okay, let's roll with that.

nerdai

Thanks for the PR @AgenP. I have left some comments. I agree with you on RR / MRR, but I don't think that change is worth the breaking changes. I like the new "HitRate" metric, but we should leave the original one as is and create a net new one.

AgenP · 2024-04-23T10:33:00Z

Thanks for the PR @AgenP. I have left some comments. I agree with you on RR / MRR, but I don't think that change is worth the breaking changes. I like the new "HitRate" metric, but we should leave the original one as is and create a net new one.

Hey @nerdai, thank you very much for the feedback. I’ve sent in my responses to your comments

Here is a quick summary of my proposed action steps

Revert any changes I've made to the compute method signature
Create a net new SQMRR metric for additional flexibility
Keep HitRate as the same metric, but with the new implementation (justification above)
Remove RR from metrics (including metric registry)

Let me know if these are all good to be implmented, or if there is room for improvement.

nerdai · 2024-04-24T03:08:25Z

Thanks for the PR @AgenP. I have left some comments. I agree with you on RR / MRR, but I don't think that change is worth the breaking changes. I like the new "HitRate" metric, but we should leave the original one as is and create a net new one.

Hey @nerdai, thank you very much for the feedback. I’ve sent in my responses to your comments

Here is a quick summary of my proposed action steps

Revert any changes I've made to the compute method signature

Create a net new SQMRR metric for additional flexibility

Keep HitRate as the same metric, but with the new implementation (justification above)

Remove RR from metrics (including metric registry)

Let me know if these are all good to be implmented, or if there is room for improvement.

I think it makes sense :) . I left replies to your latest comments!

… main

AgenP · 2024-04-26T07:24:52Z

Hey @nerdai, the iterations have been made! Here is a summary of the changes I've now implemented

MRR and HitRate changes

compute method signatures are now the same as BaseRetrievalMetric
Both have a granular implementation option through usage of a kwarg
Detailed docstrings added to enhance explainability with these new changes

RR removed

Old proposed, separate RR metric removed (both as a class, and from the metric registry)

Testing & Formatting

New unit tests all pass, no additional warnings generated
Formatting/Linting handled by pre-commit hooks

Lmk if there are any issues 💪

nerdai · 2024-04-26T15:40:00Z

Hey @nerdai, the iterations have been made! Here is a summary of the changes I've now implemented

MRR and HitRate changes

compute method signatures are now the same as BaseRetrievalMetric

Both have a granular implementation option through usage of a kwarg

Detailed docstrings added to enhance explainability with these new changes

RR removed

Old proposed, separate RR metric removed (both as a class, and from the metric registry)

Testing & Formatting

New unit tests all pass, no additional warnings generated

Formatting/Linting handled by pre-commit hooks

Lmk if there are any issues 💪

amazing thanks for the thorough summary @AgenP !

nerdai

Looks very good! I just think that maybe instead of using kwargs on compute to learn if we should use granular compute or not, maybe we just make it an instance attribute

nerdai · 2024-04-26T15:42:26Z

llama-index-core/llama_index/core/evaluation/retrieval/metrics.py

- )
+
+ # Determining which implementation to use based on `use_granular_hit_rate` kwarg
+ use_granular = kwargs.get("use_granular_hit_rate", False)


instead of using kwargs, maybe should just create a class or instance attribute use_granular?

nerdai · 2024-04-26T15:42:51Z

llama-index-core/llama_index/core/evaluation/retrieval/metrics.py

- )
+
+ # Determining which implementation to use based on `use_granular_mrr` kwarg
+ use_granular_mrr = kwargs.get("use_granular_mrr", False)


same comment as above, maybe we define this as a class/instance attribute?

that way we still satisfy the superclass method signature

AgenP · 2024-04-29T06:52:52Z

Looks very good! I just think that maybe instead of using kwargs on compute to learn if we should use granular compute or not, maybe we just make it an instance attribute

Thanks @nerdai!

My thinking with this was that since the BaseRetrievalMetric superclass' compute method also has **kwargs as a parameter, I thought that using a kwarg to distinguish the two implementations keeps the method signature the same.

What do you think?

nerdai · 2024-04-30T16:53:57Z

Looks very good! I just think that maybe instead of using kwargs on compute to learn if we should use granular compute or not, maybe we just make it an instance attribute

Thanks @nerdai!

My thinking with this was that since the BaseRetrievalMetric superclass' compute method also has **kwargs as a parameter, I thought that using a kwarg to distinguish the two implementations keeps the method signature the same.

What do you think?

Sorry for the late reply!

Yeah, I just think its a bit odd that we have to tuck these args away in kwargs. I think its harder for the user to know about this option. So, a way to not break the superclass' compute signature is just to create a class/instance attribute for these params instead:

class HitRate(BaseRetrievalMetric):
    """Hit rate metric."""

    metric_name: str = "hit_rate"
    use_granular_mrr: bool = False

    ...

    def compute(...):
          if self.use_granular_mrr:
               ....

How about something like this?

…n choice

AgenP · 2024-05-01T07:41:12Z

Looks very good! I just think that maybe instead of using kwargs on compute to learn if we should use granular compute or not, maybe we just make it an instance attribute

Thanks @nerdai!
My thinking with this was that since the BaseRetrievalMetric superclass' compute method also has **kwargs as a parameter, I thought that using a kwarg to distinguish the two implementations keeps the method signature the same.
What do you think?

Sorry for the late reply!

Yeah, I just think its a bit odd that we have to tuck these args away in kwargs. I think its harder for the user to know about this option. So, a way to not break the superclass' compute signature is just to create a class/instance attribute for these params instead:
class HitRate(BaseRetrievalMetric):
    """Hit rate metric."""

    metric_name: str = "hit_rate"
    use_granular_mrr: bool = False

    ...

    def compute(...):
          if self.use_granular_mrr:
               ....
How about something like this?

No worries whatsoever

That makes complete sense. Thank you for highlighting this

The new iteration has been commit!

nerdai

Thanks @AgenP! LGTM!

nerdai · 2024-05-01T23:24:08Z

@AgenP thanks again -- merging this for us now :)

AgenP added 2 commits April 21, 2024 14:01

Updating metrics: MRR renamed to RR, HitRate updated for multi-doc ev…

62af5fd

…aluation and new separate MRR implementation

Merge branch 'main' of https://github.com/AgenP/llama_index_fork

f302d83

dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Apr 21, 2024

AgenP marked this pull request as draft April 21, 2024 13:16

AgenP changed the title ~~Metrics PR: Updating HitRate and MRR for mueval. Adding RR as separate metric~~ Metrics PR: Updating HitRate and MRR for Evaluation@K documents retrieved. Adding RR as separate metric Apr 21, 2024

AgenP changed the title ~~Metrics PR: Updating HitRate and MRR for Evaluation@K documents retrieved. Adding RR as separate metric~~ Metrics: Updating HitRate and MRR for Evaluation@K documents retrieved. Adding RR as separate metric Apr 21, 2024

AgenP changed the title ~~Metrics: Updating HitRate and MRR for Evaluation@K documents retrieved. Adding RR as separate metric~~ Metrics: Updating HitRate and MRR for Evaluation@K documents retrieved and making RR a separate metric Apr 21, 2024

AgenP changed the title ~~Metrics: Updating HitRate and MRR for Evaluation@K documents retrieved and making RR a separate metric~~ Metrics: Updating HitRate and MRR for Evaluation@K documents retrieved. Also adding RR as a separate metric Apr 21, 2024

AgenP marked this pull request as ready for review April 21, 2024 14:26

AgenP changed the title ~~Metrics: Updating HitRate and MRR for Evaluation@K documents retrieved. Also adding RR as a separate metric~~ Retrieval Metrics: Updating HitRate and MRR for Evaluation@K documents retrieved. Also adding RR as a separate metric Apr 21, 2024

logan-markewich requested a review from nerdai April 22, 2024 02:12

nerdai reviewed Apr 22, 2024

View reviewed changes

nerdai requested changes Apr 22, 2024

View reviewed changes

AgenP added 3 commits April 26, 2024 07:57

Updated MRR and HitRate with requested changes

8ae408a

Merge branch 'run-llama:main' into main

f50c07a

Merge branch 'main' of https://github.com/AgenP/llama_index_fork into…

a54e2dd

… main

nerdai reviewed Apr 26, 2024

View reviewed changes

Iteration w/ class attribute implementation for the calculation optio…

eff2806

…n choice

nerdai approved these changes May 1, 2024

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label May 1, 2024

nerdai merged commit b5a57ca into run-llama:main May 1, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrieval Metrics: Updating HitRate and MRR for Evaluation@K documents retrieved. Also adding RR as a separate metric #12997

Retrieval Metrics: Updating HitRate and MRR for Evaluation@K documents retrieved. Also adding RR as a separate metric #12997

AgenP commented Apr 21, 2024 •

edited

nerdai Apr 22, 2024

nerdai Apr 22, 2024

AgenP Apr 23, 2024 •

edited

nerdai Apr 24, 2024

AgenP Apr 24, 2024

nerdai Apr 22, 2024

nerdai Apr 22, 2024

AgenP Apr 23, 2024 •

edited

nerdai Apr 24, 2024

nerdai Apr 22, 2024

AgenP Apr 23, 2024 •

edited

nerdai Apr 24, 2024

AgenP Apr 24, 2024

nerdai Apr 24, 2024

nerdai left a comment

AgenP commented Apr 23, 2024 •

edited

nerdai commented Apr 24, 2024

AgenP commented Apr 26, 2024 •

edited

nerdai commented Apr 26, 2024

nerdai left a comment

nerdai Apr 26, 2024

nerdai Apr 26, 2024

nerdai Apr 26, 2024

AgenP commented Apr 29, 2024

nerdai commented Apr 30, 2024

AgenP commented May 1, 2024

nerdai left a comment

nerdai commented May 1, 2024

Retrieval Metrics: Updating HitRate and MRR for Evaluation@K documents retrieved. Also adding RR as a separate metric #12997

Retrieval Metrics: Updating HitRate and MRR for Evaluation@K documents retrieved. Also adding RR as a separate metric #12997

Conversation

AgenP commented Apr 21, 2024 • edited

Description

Type of Change

How Has This Been Tested?

Suggested Checklist:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AgenP Apr 23, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AgenP Apr 23, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AgenP Apr 23, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nerdai left a comment

Choose a reason for hiding this comment

AgenP commented Apr 23, 2024 • edited

nerdai commented Apr 24, 2024

AgenP commented Apr 26, 2024 • edited

nerdai commented Apr 26, 2024

nerdai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AgenP commented Apr 29, 2024

nerdai commented Apr 30, 2024

AgenP commented May 1, 2024

nerdai left a comment

Choose a reason for hiding this comment

nerdai commented May 1, 2024

AgenP commented Apr 21, 2024 •

edited

AgenP Apr 23, 2024 •

edited

AgenP Apr 23, 2024 •

edited

AgenP Apr 23, 2024 •

edited

AgenP commented Apr 23, 2024 •

edited

AgenP commented Apr 26, 2024 •

edited