Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve AbstractMultiTermQueryConstantScoreWrapper#RewritingWeight ScorerSupplier cost estimation #13029

Open
rquesada-tibco opened this issue Jan 24, 2024 · 2 comments · May be fixed by #13201

Comments

@rquesada-tibco
Copy link

rquesada-tibco commented Jan 24, 2024

Description

We recently discovered a performance degradation on our project when going from Lucene 9.4 to 9.9. The cause seems to be a side effect of c6667e7 and 3809106

The situation is as follows: we have a WildcardQuery and a TermInSetQuery which are and-combined (within a BooleanQuery). This structure gets executed repeatedly, kind of like a nested loop where the WildcardQuery remains the same, but the TermInSetQuery keeps changing its terms. In the old version, this was fast because the WildcardQuery was cached within the LRUQueryCache. However in the new version this is no longer the case, so the execution time of our scenario has increased.

Why our WildcardQuery is not cached any more? It boils down to this line in LRUQueryCache, where the cache operation won't happen if the cost estimation is too high:

final ScorerSupplier supplier = in.scorerSupplier(context);
...
final long cost = supplier.cost();
...
// skip cache operation which would slow query down too much
if (cost / skipCacheFactor > leadCost) {
...

Before the upgrade to 9.9, that cost was provided by a ConstantScoreWeight returned by the old MultiTermQueryConstantScoreWrapper (which was returned by the default RewriteMethod), which in the end was just based on the "default" Weight#scoreSupplier implementation: basically the cost was the scorer.iterator().cost(); and in our case the WildcardQuery returns just one document, so cost 1.

After the upgrade, the default RewriteMethod has changed and now this cost is provided by AbstractMultiTermQueryConstantScoreWrapper#RewritingWeight#scorerSupplier here, and for that purpose a private estimateCost method was introduced, which bases the estimation on the MultiTermQuery#getTermsCount value. The problem is that, for our WildcardQuery (in fact for any sub-class of AutomatonQuery), this value is unknown, i.e. -1, so the estimateCost method just returns terms.getSumDocFreq(), which is clearly an overestimation in our case, so it prevents the caching, and leads to a performance degradation.

I understand that I can fix this situation by writing my customized RewriteMethod.
The question is: could we improve AbstractMultiTermQueryConstantScoreWrapper#RewritingWeight#scorerSupplier#cost so that, if the MTQ cannot provide a term count (getTermsCount() == -1) then we return scorer.iterator().cost() ?

@msfroh
Copy link
Contributor

msfroh commented Mar 21, 2024

I think the else clause for the cost estimate is also not great.

I came across this same problem where a user was essentially running a single-term TermInSetQuery (that actually matched a single doc) AND a numeric range query that matches millions of docs. I was shocked when profiler output showed a lot of time spent in the BKReader -- the IndexOrDocValues query over the range query should have clearly taken the doc values path.

Let's say you have a string field with 10 million distinct values (so 10 million terms), and they match 20 million documents (with individual terms matching 1-3 docs, say). My read is that this estimateCost() function will say that a TermInSetQuery over a single term has cost 10,000,001 (i.e. 20 million sumDocFreq minus 10 million for the terms size, plus 1 for the query term count).

I get that the absolute worst case is that 9,999,999 terms each have doc freq 1 and the remaining term has doc freq 10,000,001, but this feels silly as a cost estimate for a query that is just going to rewrite to a single TermQuery with cost <= 3.

@msfroh
Copy link
Contributor

msfroh commented Mar 22, 2024

I get that part of the point of this cost estimate is to avoid the (potentially-expensive) rewrite if, e.g. we can do a doc-value rewrite instead, but I'm thinking we could do something a little bit more term-aware.

How expensive is MultiTermQuery#getTermsEnum()? I think it's cheap, but I'm not positive. Maybe we could get it, try walking through it, summing doc freqs, giving up if the number of terms gets big.

We could go down that path only if the cost estimate from the existing logic is very high. I can try sketching out a PR with a test for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants