Feat: Add FilterRetriever #6836

bglearning · 2024-01-26T16:58:54Z

Proposed Changes:

Porting FilterRetriever from v1.

How did you test it?

Unit tests

Notes for the reviewer

Discussion notes for following points attached later on in this PR:

Q1) Should we have top_k as an argument or not?

Option 1: Don't include. Just return all docs

Option 2.1: Include. Simply return the first top_k docs.

Option 2.2: Include. Apply some sort of seeded random sampling on top.

Option 3: Include at the DocumentStore level

Thought: Ideally this would be handled at the DocumentStore level as fetching all docs for a filter and then FilterRetriever just taking top_k docs could be much more inefficient.
- But then top_k isn't part of the DocumentStore.filter_documents protocol. Could add it but maybe would bloat it as perhaps not all DocumentStores support this "sampling with filter"

Leaning towards Option-1, excluding top_k.

Q2) Runtime input metadata value filtering

Idea: For a (init)-specified metadata (e.g. "category"), for convenience we want to allow users to provide a value (e.g. "sports") at runtime. Note: for instance, would be needed as part of FileSimilaritRetriever (#5629)

Option 1: Make FilterRetriever flexible enough for this. An attribute like filter_meta_key at __init__ and then run(filter_meta_value: str,...) which would form the corresponding filter
filters = {"field": self.filter_meta_key, "operator": "==", "value": filter_meta_value} and pass it onto the document_store. Issue: increases complexity in the component.

Option 2: Create another retriever EqualsFilterRetriever inheriting from FilterRetriever and overwriting run

Option 3: Create a EqualsFilterGenerator to create the filter and connect it to FilterRetriever.filters

Slightly leaning towards Option3 though such a component feels too small/specific.

Q3) Lazy run/evaluation of the component

Wondering if there is a way to setup the component to only run if there is a downstream component needing its output.

E.g. Possible usage: the FilterRetriever is in one of many optional branches. And we would only want to run it if the branch is followed (e.g. for a certain type of query). I guess this can be generalized to any "inputless" component.

One such setup could be:

Here FilterRetriever may run even when it's not necessary.

Currently leaning towards letting this pass

Q4) location: `haystack.components.retrievers.filter_retriever` is fine?

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
I documented my code
I ran pre-commit hooks and fixed any issue

coveralls · 2024-01-26T17:06:01Z

Pull Request Test Coverage Report for Build 7826222442

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.007%) to 88.226%

Totals
Change from base Build 7817342911:	0.007%
Covered Lines:	4773
Relevant Lines:	5410

💛 - Coveralls

bglearning · 2024-01-29T11:15:31Z

Notes/decisions from discussion with @sjrl

Q1) Should we have top_k as an argument or not?

Decision: exclude.

Reason: Isn't very aligned with the idea of a FilterRetriever. Plus underlying document-store could handle (or not handle) it differently.

Q2) Runtime input metadata value filtering

Decision: leave out.

Reason: This is a convenience feature primarily aimed at supporting the proposed FileSimilarityRetriever.
Not necessary to be part of the FilterRetriever. Plus even the convenience components like EqualsFilterGenerator probably shouldn't be in haystack. We can just setup FilterRetriever to receive the full filter and leave it up to the pipeline invoker (could be the UI/frontend) to handle the filter construction.

Q3) Lazy run/evaluation of the component

Decision: Ignore for now/this PR.

Reason: Broader topic. Can be tackled separately.

Q4) location: `haystack.components.retrievers.filter_retriever` is fine?

Decision: okay. Probably fine.

bglearning · 2024-01-30T16:12:10Z

Sidenote: Copied over the class type inference of the document_store for from_dict from DocumentWriter.

try:
    module_name, type_ = init_params["document_store"]["type"].rsplit(".", 1)
    logger.debug("Trying to import %s", module_name)
    module = importlib.import_module(module_name)
except (ImportError, DeserializationError) as e:
    raise DeserializationError(
        f"DocumentStore of type '{init_params['document_store']['type']}' not correctly imported"
    ) from e

docstore_class = getattr(module, type_)

Maybe could be abstracted out somehow if this and surrounding blocks in from_dict start proliferating.

anakin87 · 2024-01-31T09:37:22Z

Hey, @bglearning, thanks for the PR, and sorry for the wait...
I am thinking about some aspects, then I will do the review.

anakin87

Hey, @bglearning!

Generally, this PR looks good (I left a comment in tests).

the deserialization logic is OK. (At some point, we could isolate it in a utility method.)
can you also include this module in the API reference docs configuration? https://github.com/deepset-ai/haystack/blob/main/docs/pydoc/config/retriever.yml
For a similar example, refer to Generators.
Having an example of usage in the docstrings would be great
(I also tag @dfokina who can better review the docstrings.)

test/components/retrievers/test_filter_retriever.py

haystack/components/retrievers/filter_retriever.py

dfokina · 2024-02-02T11:13:02Z

As @anakin87 mentioned it would be great to include a code example in the docstrings, see here for inspiration:

haystack/haystack/components/samplers/top_p.py

Line 23 in 27d0b28

Usage example:

Update retriever search path to start one dir level higher

…tack into filter-retriever

anakin87

Great job, @bglearning!

I have taken the liberty to make some final refinements.
Once the tests pass, I will merge this PR...

Add FilterRetriever draft

6dac324

github-actions bot added 2.x Related to Haystack v2.0 type:documentation Improvements on the docs labels Jan 26, 2024

bglearning requested a review from sjrl January 26, 2024 17:01

bglearning mentioned this pull request Jan 29, 2024

Lazy Run/Evaluation of components based on downstream necessity #6843

Closed

Merge branch 'main' into filter-retriever

922aa0b

github-actions bot added the topic:tests label Jan 30, 2024

Implement FilterRetriever and add tests

d7ad9b9

bglearning force-pushed the filter-retriever branch from 7af8661 to d7ad9b9 Compare January 30, 2024 16:04

bglearning marked this pull request as ready for review January 30, 2024 16:09

bglearning requested review from a team as code owners January 30, 2024 16:09

bglearning requested review from dfokina and anakin87 and removed request for a team January 30, 2024 16:09

Merge branch 'main' into filter-retriever

17d13f3

anakin87 reviewed Jan 31, 2024

View reviewed changes

test/components/retrievers/test_filter_retriever.py Outdated Show resolved Hide resolved

dfokina reviewed Feb 2, 2024

View reviewed changes

haystack/components/retrievers/filter_retriever.py Outdated Show resolved Hide resolved

bglearning added 6 commits February 7, 2024 20:46

Merge branch 'main' into filter-retriever

167df3e

Update comparison to compare whole docs instead of just contents

db789ba

Expose FilterRetriever at the retrievers level

8af7d4c

Update docstring (add example usage)

79774cd

Add filter_retriever in the API reference docs config

1c150d0

Update retriever search path to start one dir level higher

Merge branch 'filter-retriever' of https://github.com/deepset-ai/hays…

2851704

…tack into filter-retriever

anakin87 added 2 commits February 8, 2024 08:17

simplify _documents_equal

5bad89a

improve usage example

a2e61fd

anakin87 self-requested a review February 8, 2024 07:29

anakin87 approved these changes Feb 8, 2024

View reviewed changes

anakin87 merged commit 74683fe into main Feb 8, 2024
23 checks passed

anakin87 deleted the filter-retriever branch February 8, 2024 07:48

dfokina mentioned this pull request Feb 8, 2024

docs: FilterRetriever in 2.x #6959

Closed

bglearning mentioned this pull request Feb 16, 2024

Proposal to add file similarity retriever to haystack #5629

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Add FilterRetriever #6836

Feat: Add FilterRetriever #6836

bglearning commented Jan 26, 2024 •

edited

coveralls commented Jan 26, 2024 •

edited

bglearning commented Jan 29, 2024

bglearning commented Jan 30, 2024

anakin87 commented Jan 31, 2024

anakin87 left a comment

dfokina commented Feb 2, 2024

anakin87 left a comment

Feat: Add FilterRetriever #6836

Feat: Add FilterRetriever #6836

Conversation

bglearning commented Jan 26, 2024 • edited

Proposed Changes:

How did you test it?

Notes for the reviewer

Q1) Should we have top_k as an argument or not?

Q2) Runtime input metadata value filtering

Q3) Lazy run/evaluation of the component

Q4) location: haystack.components.retrievers.filter_retriever is fine?

Checklist

coveralls commented Jan 26, 2024 • edited

Pull Request Test Coverage Report for Build 7826222442

💛 - Coveralls

bglearning commented Jan 29, 2024

Q1) Should we have top_k as an argument or not?

Q2) Runtime input metadata value filtering

Q3) Lazy run/evaluation of the component

Q4) location: haystack.components.retrievers.filter_retriever is fine?

bglearning commented Jan 30, 2024

anakin87 commented Jan 31, 2024

anakin87 left a comment

Choose a reason for hiding this comment

dfokina commented Feb 2, 2024

anakin87 left a comment

Choose a reason for hiding this comment

bglearning commented Jan 26, 2024 •

edited

Q4) location: `haystack.components.retrievers.filter_retriever` is fine?

coveralls commented Jan 26, 2024 •

edited

Q4) location: `haystack.components.retrievers.filter_retriever` is fine?