Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get better cost estimate on MultiTermQuery over few terms #13201

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

msfroh
Copy link
Contributor

@msfroh msfroh commented Mar 22, 2024

Description

The existing cost estimate for multi-term queries was taking the worst-case scenario (e.g the query matches every term in the field), to provide an upper bound on the cost.

In cases where the MultiTermQuery matches a small number of terms, we can provide a tighter estimate by summing the doc frequencies over those terms.

Fixes #13029

The existing cost estimate for multi-term queries was taking the
worst-case scenario (e.g the query matches every term in the field),
to provide an upper bound on the cost.

In cases where the MultiTermQuery matches a small number of terms, we
can provide a tighter estimate by summing the doc frequencies over
those terms.
private long estimateCost(Terms terms, long queryTermsCount) throws IOException {

// Try to explicitly compute the cost, but only if the term count is sufficiently small.
if (queryTermsCount < MAX_TERMS_TO_COUNT) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If queryTermsCount is unknown (i.e. -1), we'll also enter here, I guess this is intended, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct -- I thought about making that more explicit, but opted to take advantage of the "fall through" logic.

@@ -292,7 +292,21 @@ public long cost() {
};
}

private static long estimateCost(Terms terms, long queryTermsCount) throws IOException {
private static final int MAX_TERMS_TO_COUNT = 128;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stupid question: would it be feasible / desirable to make this value somehow configurable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a great question! Maybe we could pass it as a constructor parameter, and the constant RewriteMethods in MultiTermQuery could specify the default?

I host a weekly OpenSearch Lucene Study Group on Zoom and we spent this week's meetup talking about this PR (and your issue): https://forum.opensearch.org/t/opensearch-lucene-study-group-meeting-monday-march-25th-2024/18547/3

There were some nice ideas that came up there around how to pick something better than an arbitrary limit -- like maybe using a time threshold instead. (But would that be a threshold per-clause?)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was indeed an interesting discussion. Thanks for sharing @msfroh !

Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve AbstractMultiTermQueryConstantScoreWrapper#RewritingWeight ScorerSupplier cost estimation
2 participants