-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Add FilterRetriever #6836
Conversation
Pull Request Test Coverage Report for Build 7826222442
💛 - Coveralls |
Notes/decisions from discussion with @sjrl Q1) Should we have top_k as an argument or not?Decision: exclude. Reason: Isn't very aligned with the idea of a FilterRetriever. Plus underlying document-store could handle (or not handle) it differently. Q2) Runtime input metadata value filteringDecision: leave out. Reason: This is a convenience feature primarily aimed at supporting the proposed FileSimilarityRetriever. Q3) Lazy run/evaluation of the componentDecision: Ignore for now/this PR. Reason: Broader topic. Can be tackled separately. Q4) location:
|
7af8661
to
d7ad9b9
Compare
Sidenote: Copied over the class type inference of the document_store for try:
module_name, type_ = init_params["document_store"]["type"].rsplit(".", 1)
logger.debug("Trying to import %s", module_name)
module = importlib.import_module(module_name)
except (ImportError, DeserializationError) as e:
raise DeserializationError(
f"DocumentStore of type '{init_params['document_store']['type']}' not correctly imported"
) from e
docstore_class = getattr(module, type_) Maybe could be abstracted out somehow if this and surrounding blocks in from_dict start proliferating. |
Hey, @bglearning, thanks for the PR, and sorry for the wait... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, @bglearning!
Generally, this PR looks good (I left a comment in tests).
-
the deserialization logic is OK. (At some point, we could isolate it in a utility method.)
-
can you also include this module in the API reference docs configuration? https://github.com/deepset-ai/haystack/blob/main/docs/pydoc/config/retriever.yml
For a similar example, refer to Generators. -
Having an example of usage in the docstrings would be great
(I also tag @dfokina who can better review the docstrings.)
As @anakin87 mentioned it would be great to include a code example in the docstrings, see here for inspiration:
|
Update retriever search path to start one dir level higher
…tack into filter-retriever
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job, @bglearning!
I have taken the liberty to make some final refinements.
Once the tests pass, I will merge this PR...
Proposed Changes:
Porting
FilterRetriever
from v1.How did you test it?
Unit tests
Notes for the reviewer
Discussion notes for following points attached later on in this PR:
Q1) Should we have top_k as an argument or not?
Option 1: Don't include. Just return all docs
Option 2.1: Include. Simply return the first top_k docs.
Option 2.2: Include. Apply some sort of seeded random sampling on top.
Option 3: Include at the DocumentStore level
DocumentStore.filter_documents
protocol. Could add it but maybe would bloat it as perhaps not all DocumentStores support this "sampling with filter"Leaning towards Option-1, excluding top_k.
Q2) Runtime input metadata value filtering
Idea: For a (init)-specified metadata (e.g. "category"), for convenience we want to allow users to provide a value (e.g. "sports") at runtime. Note: for instance, would be needed as part of FileSimilaritRetriever (#5629)
Option 1: Make
FilterRetriever
flexible enough for this. An attribute likefilter_meta_key
at__init__
and thenrun(filter_meta_value: str,...)
which would form the corresponding filterfilters = {"field": self.filter_meta_key, "operator": "==", "value": filter_meta_value}
and pass it onto the document_store. Issue: increases complexity in the component.Option 2: Create another retriever
EqualsFilterRetriever
inheriting fromFilterRetriever
and overwritingrun
Option 3: Create a
EqualsFilterGenerator
to create the filter and connect it toFilterRetriever.filters
Slightly leaning towards Option3 though such a component feels too small/specific.
Q3) Lazy run/evaluation of the component
Wondering if there is a way to setup the component to only run if there is a downstream component needing its output.
E.g. Possible usage: the FilterRetriever is in one of many optional branches. And we would only want to run it if the branch is followed (e.g. for a certain type of query). I guess this can be generalized to any "inputless" component.
One such setup could be:
Here
FilterRetriever
may run even when it's not necessary.Currently leaning towards letting this pass
Q4) location:
haystack.components.retrievers.filter_retriever
is fine?Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
.