Improve AbstractMultiTermQueryConstantScoreWrapper#RewritingWeight ScorerSupplier cost estimation #13029

rquesada-tibco · 2024-01-24T11:24:20Z

Description

We recently discovered a performance degradation on our project when going from Lucene 9.4 to 9.9. The cause seems to be a side effect of c6667e7 and 3809106

The situation is as follows: we have a WildcardQuery and a TermInSetQuery which are and-combined (within a BooleanQuery). This structure gets executed repeatedly, kind of like a nested loop where the WildcardQuery remains the same, but the TermInSetQuery keeps changing its terms. In the old version, this was fast because the WildcardQuery was cached within the LRUQueryCache. However in the new version this is no longer the case, so the execution time of our scenario has increased.

Why our WildcardQuery is not cached any more? It boils down to this line in LRUQueryCache, where the cache operation won't happen if the cost estimation is too high:

final ScorerSupplier supplier = in.scorerSupplier(context);
...
final long cost = supplier.cost();
...
// skip cache operation which would slow query down too much
if (cost / skipCacheFactor > leadCost) {
...

Before the upgrade to 9.9, that cost was provided by a ConstantScoreWeight returned by the old MultiTermQueryConstantScoreWrapper (which was returned by the default RewriteMethod), which in the end was just based on the "default" Weight#scoreSupplier implementation: basically the cost was the scorer.iterator().cost(); and in our case the WildcardQuery returns just one document, so cost 1.

After the upgrade, the default RewriteMethod has changed and now this cost is provided by AbstractMultiTermQueryConstantScoreWrapper#RewritingWeight#scorerSupplier here, and for that purpose a private estimateCost method was introduced, which bases the estimation on the MultiTermQuery#getTermsCount value. The problem is that, for our WildcardQuery (in fact for any sub-class of AutomatonQuery), this value is unknown, i.e. -1, so the estimateCost method just returns terms.getSumDocFreq(), which is clearly an overestimation in our case, so it prevents the caching, and leads to a performance degradation.

I understand that I can fix this situation by writing my customized RewriteMethod.
The question is: could we improve AbstractMultiTermQueryConstantScoreWrapper#RewritingWeight#scorerSupplier#cost so that, if the MTQ cannot provide a term count (getTermsCount() == -1) then we return scorer.iterator().cost() ?

The text was updated successfully, but these errors were encountered:

msfroh · 2024-03-21T23:15:08Z

I think the else clause for the cost estimate is also not great.

I came across this same problem where a user was essentially running a single-term TermInSetQuery (that actually matched a single doc) AND a numeric range query that matches millions of docs. I was shocked when profiler output showed a lot of time spent in the BKReader -- the IndexOrDocValues query over the range query should have clearly taken the doc values path.

Let's say you have a string field with 10 million distinct values (so 10 million terms), and they match 20 million documents (with individual terms matching 1-3 docs, say). My read is that this estimateCost() function will say that a TermInSetQuery over a single term has cost 10,000,001 (i.e. 20 million sumDocFreq minus 10 million for the terms size, plus 1 for the query term count).

I get that the absolute worst case is that 9,999,999 terms each have doc freq 1 and the remaining term has doc freq 10,000,001, but this feels silly as a cost estimate for a query that is just going to rewrite to a single TermQuery with cost <= 3.

msfroh · 2024-03-22T00:11:02Z

I get that part of the point of this cost estimate is to avoid the (potentially-expensive) rewrite if, e.g. we can do a doc-value rewrite instead, but I'm thinking we could do something a little bit more term-aware.

How expensive is MultiTermQuery#getTermsEnum()? I think it's cheap, but I'm not positive. Maybe we could get it, try walking through it, summing doc freqs, giving up if the number of terms gets big.

We could go down that path only if the cost estimate from the existing logic is very high. I can try sketching out a PR with a test for it.

rquesada-tibco added the type:enhancement label Jan 24, 2024

msfroh linked a pull request Mar 22, 2024 that will close this issue

Get better cost estimate on MultiTermQuery over few terms #13201

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve AbstractMultiTermQueryConstantScoreWrapper#RewritingWeight ScorerSupplier cost estimation #13029

Improve AbstractMultiTermQueryConstantScoreWrapper#RewritingWeight ScorerSupplier cost estimation #13029

rquesada-tibco commented Jan 24, 2024 •

edited

msfroh commented Mar 21, 2024

msfroh commented Mar 22, 2024

Improve AbstractMultiTermQueryConstantScoreWrapper#RewritingWeight ScorerSupplier cost estimation #13029

Improve AbstractMultiTermQueryConstantScoreWrapper#RewritingWeight ScorerSupplier cost estimation #13029

Comments

rquesada-tibco commented Jan 24, 2024 • edited

Description

msfroh commented Mar 21, 2024

msfroh commented Mar 22, 2024

rquesada-tibco commented Jan 24, 2024 •

edited