New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve AbstractMultiTermQueryConstantScoreWrapper#RewritingWeight ScorerSupplier cost estimation #13029
Comments
I think the I came across this same problem where a user was essentially running a single-term Let's say you have a string field with 10 million distinct values (so 10 million terms), and they match 20 million documents (with individual terms matching 1-3 docs, say). My read is that this I get that the absolute worst case is that 9,999,999 terms each have doc freq 1 and the remaining term has doc freq 10,000,001, but this feels silly as a cost estimate for a query that is just going to rewrite to a single |
I get that part of the point of this cost estimate is to avoid the (potentially-expensive) rewrite if, e.g. we can do a doc-value rewrite instead, but I'm thinking we could do something a little bit more term-aware. How expensive is We could go down that path only if the cost estimate from the existing logic is very high. I can try sketching out a PR with a test for it. |
Description
We recently discovered a performance degradation on our project when going from Lucene 9.4 to 9.9. The cause seems to be a side effect of c6667e7 and 3809106
The situation is as follows: we have a
WildcardQuery
and aTermInSetQuery
which are and-combined (within aBooleanQuery
). This structure gets executed repeatedly, kind of like a nested loop where theWildcardQuery
remains the same, but theTermInSetQuery
keeps changing its terms. In the old version, this was fast because theWildcardQuery
was cached within theLRUQueryCache
. However in the new version this is no longer the case, so the execution time of our scenario has increased.Why our
WildcardQuery
is not cached any more? It boils down to this line inLRUQueryCache
, where the cache operation won't happen if the cost estimation is too high:Before the upgrade to 9.9, that cost was provided by a
ConstantScoreWeight
returned by the oldMultiTermQueryConstantScoreWrapper
(which was returned by the defaultRewriteMethod
), which in the end was just based on the "default"Weight#scoreSupplier
implementation: basically the cost was thescorer.iterator().cost();
and in our case theWildcardQuery
returns just one document, so cost 1.After the upgrade, the default
RewriteMethod
has changed and now this cost is provided byAbstractMultiTermQueryConstantScoreWrapper#RewritingWeight#scorerSupplier
here, and for that purpose a private estimateCost method was introduced, which bases the estimation on the MultiTermQuery#getTermsCount value. The problem is that, for ourWildcardQuery
(in fact for any sub-class ofAutomatonQuery
), this value is unknown, i.e.-1
, so theestimateCost
method just returnsterms.getSumDocFreq()
, which is clearly an overestimation in our case, so it prevents the caching, and leads to a performance degradation.I understand that I can fix this situation by writing my customized
RewriteMethod
.The question is: could we improve
AbstractMultiTermQueryConstantScoreWrapper#RewritingWeight#scorerSupplier#cost
so that, if the MTQ cannot provide a term count (getTermsCount() == -1
) then we returnscorer.iterator().cost()
?The text was updated successfully, but these errors were encountered: