Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix IndexOutOfBoundsException thrown in DefaultPassageFormatter by unordered matches #13315

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

scampi
Copy link
Contributor

@scampi scampi commented Apr 17, 2024

  • test: add unit tests to reproduce the IndexOutOfBoundsException in DefaultPassageFormatter
  • doc: clarify javadoc of MatchesIterator
  • fix: sort passages by offset if positions are missing

Fix #12431

Description

DefaultPassageFormatter may cause an IndexOutOfBoundsException in case matches of a passage are out of order.
This happens when term vectors are stored but not the positions.

The tentative solution consists in ordering matches according to offsets in DisjunctionMatchesIterator in case positions don't exist.
However, I wonder if it is correct to create such an iterator when positions are not available. Matches with no terms are explicitly removed, and the javadoc of MATCH_WITH_NO_TERMS mentions that it indicates a match with no term positions. Therefore, from that doc it shouldn't be possible to highlight text without positions. Is that javadoc correct ? Should it be updated ?

List<Matches> sm = subMatches.stream().filter(m -> m != MATCH_WITH_NO_TERMS).toList();

/**
* Indicates a match with no term positions, for example on a Point or DocValues field, or a field
* indexed as docs and freqs only
*/
public static final Matches MATCH_WITH_NO_TERMS =

@scampi scampi marked this pull request as ready for review May 6, 2024 13:13
Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label May 21, 2024
} catch (IOException e) {
throw new IllegalArgumentException("Failed to retrieve term offset", e);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's weird that start/endOffset can throw IOException but start/endPosition do not :) Nothing to fix in this PR, just noting the inconsistency ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because the work of loading positions is always done in next(), so there's no IO involved when calling startPosition or endPosition. But we only load offsets when specifically asked, so IO may be involved when calling those methods. It does look a bit funky though, I agree!

@mikemccand
Copy link
Member

What a fun and tricky corner case -- thank you @scampi for uncovering this, showing the bug with the added unit tests, and the tentative fix.

I think it is actually technically allowed to index term vectors with offsets only (no positions) and then hilite from there: Lucene is recording the character offsets of every occurence of each unique term, per document, and that's all that is needed to hilite. So I like your fix -- I'll merge it soon unless anyone objects.

Could you maybe add a lucene/CHANGES.txt entry?

I think the intent of MATCH_WITH_NO_TERMS is different: that is for queries against Lucene index structures that fundamenally do not track positional information (doc values, points). With term vectors, the positional information indeed exists, it's just that the user may choose to record only offsets.

} catch (IOException e) {
throw new IllegalArgumentException("Failed to retrieve term offset", e);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because the work of loading positions is always done in next(), so there's no IO involved when calling startPosition or endPosition. But we only load offsets when specifically asked, so IO may be involved when calling those methods. It does look a bit funky though, I agree!

@github-actions github-actions bot removed the Stale label May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants