Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrieval Metrics: Updating HitRate and MRR for Evaluation@K documents retrieved. Also adding RR as a separate metric #12997

Merged
merged 6 commits into from May 1, 2024

Conversation

AgenP
Copy link
Contributor

@AgenP AgenP commented Apr 21, 2024

Side notes:

  1. Quick thanks to all the devs who've worked on LlamaIndex. It has been instrumental in supercharging my ability to build 💪
  2. I'd love any feedback that I can iterate on, to help get this merged

Description

HitRate edit (HitRate@K)

  • Changed to allow for non-binary scoring --> To widen the evaluation value of this metric, from my perspective.

  • Example: 5/10 relevant docs retrieved scores 0.5 (instead of 1.0) and 2/10 would be 0.2 (instead of 1.0) --> Allowing for a much more detailed analysis of the amount of expected documents retrieved

RR edit

  • RR: The original implementation of MRR breaks out after one reciprocal rank calculation. So I renamed it RR for clarity

MRR edit (MRR@K)

  • MRR: MRR now calculates mean reciprocal rank score for all retrieved docs within the single call.
  • Example: 2 of of 3 docs retrieved are relevant, and are at the 1st and 3rd spot. This scores ( (1 + 1/3) / 2 = 0.67)
  • Idea - New name? More precisely could be called single query mean reciprocal rank (SQMRR) for clarity

Other changes made

  • Set data type used with expected ids, to increase speed of membership checks vs. a list (only in HitRate and MRR) --> Future implementation note: Should NOT be used in any metric where the order of the expected ids matters, since casting a list to a set may change the order of the elements
  • Error handling to also catch if empty lists are passed as args, for retrieved IDs or expected IDs (raises ValueError)
  • Removed unused parameters
  • Added RR to metric registry

Type of Change

  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

  • Added new unit/integration tests
  • Added new notebook (that tests end-to-end)
  • I stared at the code and made sure it makes sense

Suggested Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added Google Colab support for the newly added notebooks.
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran make format; make lint to appease the lint gods

@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Apr 21, 2024
@AgenP AgenP marked this pull request as draft April 21, 2024 13:16
@AgenP AgenP changed the title Metrics PR: Updating HitRate and MRR for mueval. Adding RR as separate metric Metrics PR: Updating HitRate and MRR for Evaluation@K documents retrieved. Adding RR as separate metric Apr 21, 2024
@AgenP AgenP changed the title Metrics PR: Updating HitRate and MRR for Evaluation@K documents retrieved. Adding RR as separate metric Metrics: Updating HitRate and MRR for Evaluation@K documents retrieved. Adding RR as separate metric Apr 21, 2024
@AgenP AgenP changed the title Metrics: Updating HitRate and MRR for Evaluation@K documents retrieved. Adding RR as separate metric Metrics: Updating HitRate and MRR for Evaluation@K documents retrieved and making RR a separate metric Apr 21, 2024
@AgenP AgenP changed the title Metrics: Updating HitRate and MRR for Evaluation@K documents retrieved and making RR a separate metric Metrics: Updating HitRate and MRR for Evaluation@K documents retrieved. Also adding RR as a separate metric Apr 21, 2024
@AgenP AgenP marked this pull request as ready for review April 21, 2024 14:26
@AgenP AgenP changed the title Metrics: Updating HitRate and MRR for Evaluation@K documents retrieved. Also adding RR as a separate metric Retrieval Metrics: Updating HitRate and MRR for Evaluation@K documents retrieved. Also adding RR as a separate metric Apr 21, 2024
@@ -12,45 +12,51 @@


class HitRate(BaseRetrievalMetric):
"""Hit rate metric."""
"""Hit rate metric: Compute the proportion of matches between retrieved documents and expected documents."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the metric, but I'm not sure I like it as a replacement to HitRate, since "Hit" implies a binary nature.

Perhaps we can call this MeanDocumentMatch or something that represents an average of proportions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I am recommending leaving HitRate as is and creating a net new BaseRetrievalMetric

Copy link
Contributor Author

@AgenP AgenP Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s a good point. For the proposed implementation, would you not say that technically each document gets a binary representation in the score (since each doc is hit, or not hit)?

Then the summation and division allow for the accounting, and comparison, of multiple docs in the score.

Note: I realise this would alter the usage pattern, compared to past HitRate implementations, so that adds a cost to this idea!

However, in one metric, we could get the binary representation and also avoid the information lost by having only 1 or 0 for scoring options, regardless of the amount of docs used.

Let me know if you think this is worth it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, each doc definitely gets assigned a binary value to it i.e. hit or no hit. But yea the usage pattern would be broken here unfortunately if we go down to that level of granularity with HitRate.

Perhaps your suggesting some sort of parameter to control the calculation of the HitRate, which if that's the case I think would be fine. We could make the default calculation be the current computation, and allow the user to specify if they would like to use the more granular calculation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea!

I'll work on that

@@ -12,45 +12,51 @@


class HitRate(BaseRetrievalMetric):
"""Hit rate metric."""
"""Hit rate metric: Compute the proportion of matches between retrieved documents and expected documents."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I am recommending leaving HitRate as is and creating a net new BaseRetrievalMetric

expected_ids: Optional[List[str]] = None,
retrieved_ids: Optional[List[str]] = None,
expected_texts: Optional[List[str]] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure if these can be removed. Did you check if removing these args still satisfies the compute method signature for BaseRetrievalMetric?

Copy link
Contributor Author

@AgenP AgenP Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep you are correct. I did some research and it would violate the Liskov substitution principle which is not the one, particularly with an ABC and its abstract methods.

When making the changes, I will ensure the method signatures are kept the same

Thanks for highlighting this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment on lines 70 to 74
class MRR(BaseRetrievalMetric):
"""Mean Reciprocal Rank (MRR): Sums up the reciprocal rank score for each relevant retrieved document.
Then divides by the count of relevant documents.
"""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will break the current way we invoke MRR to be computed. While I agree that RR and this MRR computation is more technically correct, I feel its a bit pedantic (and, I'm pretty pedantic myself!) and might not be worth the hassle of ensuring backwards compatibility and risking breaking changes.

Copy link
Contributor Author

@AgenP AgenP Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes sense. I completely understand.

What do you think about having a net new metric called SQMRR (single query mean reciprocal rank)?

To boost the flexibility of your evaluation suite, for devs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would SQMRR ultimately compute the MRR tho? i.e., after we take an average of all SQMRR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, to a certain degree. It would be more granular version (since it scores based on multiple docs each time)

I could implement it as an optional implementation, using your parameter idea. Incase the user wants multiple docs scored each time

Let me know what you think

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, let's roll with that.

Copy link
Contributor

@nerdai nerdai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @AgenP. I have left some comments. I agree with you on RR / MRR, but I don't think that change is worth the breaking changes. I like the new "HitRate" metric, but we should leave the original one as is and create a net new one.

@AgenP
Copy link
Contributor Author

AgenP commented Apr 23, 2024

Thanks for the PR @AgenP. I have left some comments. I agree with you on RR / MRR, but I don't think that change is worth the breaking changes. I like the new "HitRate" metric, but we should leave the original one as is and create a net new one.

Hey @nerdai, thank you very much for the feedback. I’ve sent in my responses to your comments

Here is a quick summary of my proposed action steps

  1. Revert any changes I've made to the compute method signature
  2. Create a net new SQMRR metric for additional flexibility
  3. Keep HitRate as the same metric, but with the new implementation (justification above)
  4. Remove RR from metrics (including metric registry)

Let me know if these are all good to be implmented, or if there is room for improvement.

@nerdai
Copy link
Contributor

nerdai commented Apr 24, 2024

Thanks for the PR @AgenP. I have left some comments. I agree with you on RR / MRR, but I don't think that change is worth the breaking changes. I like the new "HitRate" metric, but we should leave the original one as is and create a net new one.

Hey @nerdai, thank you very much for the feedback. I’ve sent in my responses to your comments

Here is a quick summary of my proposed action steps

  1. Revert any changes I've made to the compute method signature
  2. Create a net new SQMRR metric for additional flexibility
  3. Keep HitRate as the same metric, but with the new implementation (justification above)
  4. Remove RR from metrics (including metric registry)

Let me know if these are all good to be implmented, or if there is room for improvement.

I think it makes sense :) . I left replies to your latest comments!

@AgenP
Copy link
Contributor Author

AgenP commented Apr 26, 2024

Hey @nerdai, the iterations have been made! Here is a summary of the changes I've now implemented

MRR and HitRate changes

  • compute method signatures are now the same as BaseRetrievalMetric
  • Both have a granular implementation option through usage of a kwarg
  • Detailed docstrings added to enhance explainability with these new changes

RR removed

  • Old proposed, separate RR metric removed (both as a class, and from the metric registry)

Testing & Formatting

  • New unit tests all pass, no additional warnings generated
  • Formatting/Linting handled by pre-commit hooks

Lmk if there are any issues 💪

@nerdai
Copy link
Contributor

nerdai commented Apr 26, 2024

Hey @nerdai, the iterations have been made! Here is a summary of the changes I've now implemented

MRR and HitRate changes

  • compute method signatures are now the same as BaseRetrievalMetric
  • Both have a granular implementation option through usage of a kwarg
  • Detailed docstrings added to enhance explainability with these new changes

RR removed

  • Old proposed, separate RR metric removed (both as a class, and from the metric registry)

Testing & Formatting

  • New unit tests all pass, no additional warnings generated
  • Formatting/Linting handled by pre-commit hooks

Lmk if there are any issues 💪

amazing thanks for the thorough summary @AgenP !

Copy link
Contributor

@nerdai nerdai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good! I just think that maybe instead of using kwargs on compute to learn if we should use granular compute or not, maybe we just make it an instance attribute

)

# Determining which implementation to use based on `use_granular_hit_rate` kwarg
use_granular = kwargs.get("use_granular_hit_rate", False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of using kwargs, maybe should just create a class or instance attribute use_granular?

)

# Determining which implementation to use based on `use_granular_mrr` kwarg
use_granular_mrr = kwargs.get("use_granular_mrr", False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above, maybe we define this as a class/instance attribute?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that way we still satisfy the superclass method signature

@AgenP
Copy link
Contributor Author

AgenP commented Apr 29, 2024

Looks very good! I just think that maybe instead of using kwargs on compute to learn if we should use granular compute or not, maybe we just make it an instance attribute

Thanks @nerdai!

My thinking with this was that since the BaseRetrievalMetric superclass' compute method also has **kwargs as a parameter, I thought that using a kwarg to distinguish the two implementations keeps the method signature the same.

What do you think?

@nerdai
Copy link
Contributor

nerdai commented Apr 30, 2024

Looks very good! I just think that maybe instead of using kwargs on compute to learn if we should use granular compute or not, maybe we just make it an instance attribute

Thanks @nerdai!

My thinking with this was that since the BaseRetrievalMetric superclass' compute method also has **kwargs as a parameter, I thought that using a kwarg to distinguish the two implementations keeps the method signature the same.

What do you think?

Sorry for the late reply!

Yeah, I just think its a bit odd that we have to tuck these args away in kwargs. I think its harder for the user to know about this option. So, a way to not break the superclass' compute signature is just to create a class/instance attribute for these params instead:

class HitRate(BaseRetrievalMetric):
    """Hit rate metric."""

    metric_name: str = "hit_rate"
    use_granular_mrr: bool = False

    ...

    def compute(...):
          if self.use_granular_mrr:
               ....

How about something like this?

@AgenP
Copy link
Contributor Author

AgenP commented May 1, 2024

Looks very good! I just think that maybe instead of using kwargs on compute to learn if we should use granular compute or not, maybe we just make it an instance attribute

Thanks @nerdai!
My thinking with this was that since the BaseRetrievalMetric superclass' compute method also has **kwargs as a parameter, I thought that using a kwarg to distinguish the two implementations keeps the method signature the same.
What do you think?

Sorry for the late reply!

Yeah, I just think its a bit odd that we have to tuck these args away in kwargs. I think its harder for the user to know about this option. So, a way to not break the superclass' compute signature is just to create a class/instance attribute for these params instead:

class HitRate(BaseRetrievalMetric):
    """Hit rate metric."""

    metric_name: str = "hit_rate"
    use_granular_mrr: bool = False

    ...

    def compute(...):
          if self.use_granular_mrr:
               ....

How about something like this?

No worries whatsoever

That makes complete sense. Thank you for highlighting this

The new iteration has been commit!

Copy link
Contributor

@nerdai nerdai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @AgenP! LGTM!

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label May 1, 2024
@nerdai
Copy link
Contributor

nerdai commented May 1, 2024

@AgenP thanks again -- merging this for us now :)

@nerdai nerdai merged commit b5a57ca into run-llama:main May 1, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants