Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Add FilterRetriever #6836

Merged
merged 12 commits into from
Feb 8, 2024
Merged

Feat: Add FilterRetriever #6836

merged 12 commits into from
Feb 8, 2024

Conversation

bglearning
Copy link
Contributor

@bglearning bglearning commented Jan 26, 2024

Proposed Changes:

Porting FilterRetriever from v1.

How did you test it?

Unit tests

Notes for the reviewer

Discussion notes for following points attached later on in this PR:

Q1) Should we have top_k as an argument or not?

Option 1: Don't include. Just return all docs

Option 2.1: Include. Simply return the first top_k docs.

Option 2.2: Include. Apply some sort of seeded random sampling on top.

Option 3: Include at the DocumentStore level

  • Thought: Ideally this would be handled at the DocumentStore level as fetching all docs for a filter and then FilterRetriever just taking top_k docs could be much more inefficient.
    • But then top_k isn't part of the DocumentStore.filter_documents protocol. Could add it but maybe would bloat it as perhaps not all DocumentStores support this "sampling with filter"

Leaning towards Option-1, excluding top_k.

Q2) Runtime input metadata value filtering

Idea: For a (init)-specified metadata (e.g. "category"), for convenience we want to allow users to provide a value (e.g. "sports") at runtime. Note: for instance, would be needed as part of FileSimilaritRetriever (#5629)

Option 1: Make FilterRetriever flexible enough for this. An attribute like filter_meta_key at __init__ and then run(filter_meta_value: str,...) which would form the corresponding filter
filters = {"field": self.filter_meta_key, "operator": "==", "value": filter_meta_value} and pass it onto the document_store. Issue: increases complexity in the component.

Option 2: Create another retriever EqualsFilterRetriever inheriting from FilterRetriever and overwriting run

Option 3: Create a EqualsFilterGenerator to create the filter and connect it to FilterRetriever.filters

Slightly leaning towards Option3 though such a component feels too small/specific.

Q3) Lazy run/evaluation of the component

Wondering if there is a way to setup the component to only run if there is a downstream component needing its output.

E.g. Possible usage: the FilterRetriever is in one of many optional branches. And we would only want to run it if the branch is followed (e.g. for a certain type of query). I guess this can be generalized to any "inputless" component.

One such setup could be:

filter_branch

Here FilterRetriever may run even when it's not necessary.

Currently leaning towards letting this pass

Q4) location: haystack.components.retrievers.filter_retriever is fine?

Checklist

@github-actions github-actions bot added 2.x Related to Haystack v2.0 type:documentation Improvements on the docs labels Jan 26, 2024
@bglearning bglearning requested a review from sjrl January 26, 2024 17:01
@coveralls
Copy link
Collaborator

coveralls commented Jan 26, 2024

Pull Request Test Coverage Report for Build 7826222442

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.007%) to 88.226%

Totals Coverage Status
Change from base Build 7817342911: 0.007%
Covered Lines: 4773
Relevant Lines: 5410

💛 - Coveralls

@bglearning
Copy link
Contributor Author

Notes/decisions from discussion with @sjrl

Q1) Should we have top_k as an argument or not?

Decision: exclude.

Reason: Isn't very aligned with the idea of a FilterRetriever. Plus underlying document-store could handle (or not handle) it differently.

Q2) Runtime input metadata value filtering

Decision: leave out.

Reason: This is a convenience feature primarily aimed at supporting the proposed FileSimilarityRetriever.
Not necessary to be part of the FilterRetriever. Plus even the convenience components like EqualsFilterGenerator probably shouldn't be in haystack. We can just setup FilterRetriever to receive the full filter and leave it up to the pipeline invoker (could be the UI/frontend) to handle the filter construction.

Q3) Lazy run/evaluation of the component

Decision: Ignore for now/this PR.

Reason: Broader topic. Can be tackled separately.

Q4) location: haystack.components.retrievers.filter_retriever is fine?

Decision: okay. Probably fine.

@bglearning bglearning marked this pull request as ready for review January 30, 2024 16:09
@bglearning bglearning requested review from a team as code owners January 30, 2024 16:09
@bglearning bglearning requested review from dfokina and anakin87 and removed request for a team January 30, 2024 16:09
@bglearning
Copy link
Contributor Author

Sidenote: Copied over the class type inference of the document_store for from_dict from DocumentWriter.

try:
    module_name, type_ = init_params["document_store"]["type"].rsplit(".", 1)
    logger.debug("Trying to import %s", module_name)
    module = importlib.import_module(module_name)
except (ImportError, DeserializationError) as e:
    raise DeserializationError(
        f"DocumentStore of type '{init_params['document_store']['type']}' not correctly imported"
    ) from e

docstore_class = getattr(module, type_)

Maybe could be abstracted out somehow if this and surrounding blocks in from_dict start proliferating.

@anakin87
Copy link
Member

Hey, @bglearning, thanks for the PR, and sorry for the wait...
I am thinking about some aspects, then I will do the review.

Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, @bglearning!

Generally, this PR looks good (I left a comment in tests).

test/components/retrievers/test_filter_retriever.py Outdated Show resolved Hide resolved
@dfokina
Copy link
Contributor

dfokina commented Feb 2, 2024

As @anakin87 mentioned it would be great to include a code example in the docstrings, see here for inspiration:

@anakin87 anakin87 self-requested a review February 8, 2024 07:29
Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job, @bglearning!

I have taken the liberty to make some final refinements.
Once the tests pass, I will merge this PR...

@anakin87 anakin87 merged commit 74683fe into main Feb 8, 2024
23 checks passed
@anakin87 anakin87 deleted the filter-retriever branch February 8, 2024 07:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.x Related to Haystack v2.0 topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants