Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong facet counts and returned hits when using hybrid search #4494

Open
bb opened this issue Mar 14, 2024 · 2 comments
Open

Wrong facet counts and returned hits when using hybrid search #4494

bb opened this issue Mar 14, 2024 · 2 comments
Labels
needs more info This issue needs a minimal complete and verifiable example

Comments

@bb
Copy link

bb commented Mar 14, 2024

Describe the bug

I tried vector search for the first time, using huggingface, BAAI/bge-base-en-v1.5) with v1.7.0 (from official Docker image).

When doing a hybrid query with any semantic ratio greater 0, e.g. 0.1, I receive all facets with their total absolute counts and also all documents instead of a filtered list as long as I don't filter.

When also filtering, the facets seem to work (but I'm not 100% sure yet) but results returned still not what I expected.

To Reproduce

Test Data

I have a data set with 8217 documents.

Facet counts without filtering nor search (excerpt)
Czechia 33
Denmark 15
Estonia 18
Finland 47
France 488
Georgia 1
Germany 592
Greece 202
Hungary 340
Iceland 4
Ireland 24

searching for paris, no filtering

semanticRatio=0
959 total results

Facet counts, without filtering (excerpt)
Czechia 2
Denmark 3
Estonia 3
Finland 4
France 153
Germany 21
Greece 20
Hungary 15
Iceland 1
Ireland 1

semanticRatio>0, e.g. 0.1, 0.2, 0.5, 0.9 etc. (same results as if nothing is searched/filtered)
8217 total results

Facet counts, without filtering (excerpt)
Czechia 33
Denmark 15
Estonia 18
Finland 47
France 488
Georgia 1
Germany 592
Greece 202
Hungary 340
Iceland 4
Ireland 24

Now when I start selecting a facet, I actually see a smaller number of result hits -- this number is the same as the facet value clicked.
E.g. when clicking (filtering) "Germany 592", I get 592 results. All results are from Germany which is good. Overall, this is better, but not the expected behavior.

Interestingly, now the facet counts are changed:

Czechia 2
Denmark 3
Estonia 3
Finland 4
France 153
Germany 21
Greece 20
Hungary 15
Iceland 1
Ireland 1

This is basically the same as with semanticRatio=0.
It's definitely not what I expected but it's better than no filtering of facet values.

It's inconsistent to have Germany 21 here but Germany 592 in the list. Those should be same.

Not sure if it's relevant here, this is using multisearch endpoint.

Expected behavior

With hybrid search, facets should work basically the same as without hybrid search. Of course I expect a few more or a few less results depending on the query and a bit differing facet counts, but not all results / total facet counts.

Meilisearch version:
[e.g. v1.7.0]

Additional context
Additional information that may be relevant to the issue.
Docker for Desktop, macOS

@dureuill
Copy link
Contributor

Hello @bb 👋

The behavior is expected here. As all your documents have an embedding, they are all a candidate for the query.

Another way of seeing semantic search is that it is sorting the documents that have an embedding according to their similarity with the embedding of the query.

We have plans to add a score threshold such that documents under the threshold don't appear in the results or facet counts, would that help your use case?

Now, when I start selecting a facet, I actually see a smaller number of result hits -- this number is the same as the facet value clicked.

Yes, selecting a facet adds a filter, which removes documents that don't match the filter from the results.

Interestingly, now the facet counts are changed: [...] This is basically the same as with semanticRatio=0.

Now I'm not sure I understand the behavior here. Could you indicate the search request that resulted in the facet behavior being the same as with semanticRatio=0?

When doing a hybrid search, we are also considering the candidates as the union of the candidates of the semantic search and the candidates of the keyword search. Which should result in all the documents matching the filter, or all the documents of the index when there is no filter.

That said, in the hybrid search, we don't run the semantic search if we get a sufficient number of results from the keyword search that are relevant enough (ranking score is high enough). With a low semantic ratio, this can easily happen, so that might have been the case with your query, explaining the shift in behavior you noticed.

I hope that clarifies the behavior you've been seeing!

@curquiza curquiza added the needs more info This issue needs a minimal complete and verifiable example label Mar 19, 2024
@dureuill
Copy link
Contributor

dureuill commented May 7, 2024

Hello,

I released a prototype to add the rankingScoreThreshold to search queries, which allows removing results whose _rankingScore is under the specified threshold: #4609 (comment)

Unfortunately, its effect on the facet counts is not as good as planned. See the linked thread for details. We're looking for feedback on this prototype and, in particular, whether it serves the issue here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs more info This issue needs a minimal complete and verifiable example
Projects
None yet
Development

No branches or pull requests

3 participants