Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter group on with_lookup points #3970

Open
JosuaKrause opened this issue Apr 4, 2024 · 2 comments
Open

filter group on with_lookup points #3970

JosuaKrause opened this issue Apr 4, 2024 · 2 comments

Comments

@JosuaKrause
Copy link

Is your feature request related to a problem? Please describe.

Currently there is no way (that I know of) to filter a group query by the payload of the points linked through with_lookup.

Describe the solution you'd like

Add a separate filter that operates on the with_lookup points.

Describe alternatives you've considered

My current workaround is to duplicate the relevant fields on the main points which kind of defeats the point of using the with_lookup feature.

Additional context

Happy to provide more info / code examples if needed.

@generall
Copy link
Member

generall commented Apr 4, 2024

Hey @JosuaKrause, this is reasonable request, but it is very hard to implement in distributed setup. You would basically need to do a distributed join, which proved to have very poor performance. It is unlikely that we will implement exactly this option any time soon.

@JosuaKrause
Copy link
Author

JosuaKrause commented Apr 4, 2024

That is sad to hear. Besides the workaround I mentioned above (i.e., duplicating payloads) I experimented with two additional workaround strategies:

a) retrieve groups as is and manually filter the joined payload afterwards. if the result is now shorter than the required length repeat the groups query with a larger limit until you have enough results or you have exhausted all points.

b) perform the filter on the linked points first and retrieve the key/id used for joining. then add a filter to match those keys/ids to the group query

both approaches work okay (i.e., it takes multiple minutes for some queries but it avoids duplicating payload data). if the number of matches is low a) performs badly and if the number of matches is high b) performs badly (in both worst cases we have to scan through all points). I can guess how many points might match to some degree for simple filters so I use that to decide the strategy.

even though those strategies are not optimal their speed would improve significantly if implemented in qdrant:

a) can continue the query if needed without recomputing previous results for each subsequent query (right now I double the limit each time which results in a O(2n) runtime instead of a quadratic runtime when increasing the limit linearly)

b) can collect the keys/ids internally and immediately use them without sending all of them to the client first and the client sending all back. in a distributed setting each node can compute and use their own keys/ids without sharing them with other nodes if corresponding points of the collections are sharded the same way

could be opt-in with huge warnings about performance

some info about my db (not too big but hopefully illustrates why duplicating payloads is not ideal): document collection ~9000 (no vectors but payloads), snippet collection ~210000 (no payload but vectors)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants