Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

org.apache.lucene.analysis.tests.TestRandomChains.testRandomChains test failure #13271

Open
benwtrent opened this issue Apr 4, 2024 · 15 comments

Comments

@benwtrent
Copy link
Member

Description

Haven't finished bisection to figure out the origin of the bug.

org.apache.lucene.analysis.tests.TestRandomChains > testRandomChains FAILED
    java.lang.IllegalStateException: last stage: inconsistent endOffset at pos=1: 8 vs 11; token=ڎbdb觸 Ǧ
        at __randomizedtesting.SeedInfo.seed([3AEBBB09EA70DCE:3E4F92D1D9B5100E]:0)
        at [email protected]/org.apache.lucene.tests.analysis.ValidatingTokenFilter.incrementToken(ValidatingTokenFilter.java:164)
        at [email protected]/org.apache.lucene.tests.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:1289)
        at [email protected]/org.apache.lucene.tests.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:1187)
        at [email protected]/org.apache.lucene.tests.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:1081)
        at [email protected]/org.apache.lucene.analysis.tests.TestRandomChains.testRandomChains(TestRandomChains.java:945)
        at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
  2> stage 0: ڎ<[1-2] +1> d<[4-5] +1> 觸<[7-8] +1> Ǧ<[10-11] +1>
  2> stage 1: ڎ<[1-2] +1> d<[4-5] +1> 觸<[7-8] +1> Ǧ<[10-11] +1>
  2> stage 2: ڎ<[1-2] +1> ڎ d<[1-5] +0> d<[4-5] +1> d 觸<[4-8] +0> 觸<[7-8] +1> 觸 Ǧ<[7-11] +0>
  2> last stage: ڎbdb觸<[1-8] +1> ڎbdb觸 Ǧ<[1-11] +0>
  2> TEST FAIL: useCharFilter=true text='\u0003\ufb87\udacd\uddf6d\uf1f4\u02e6\u89f8\uda06\udfd2\u01e6'
  2> Exception from random analyzer:
  2> charfilters=
  2>   org.apache.lucene.analysis.icu.ICUNormalizer2CharFilter(java.io.StringReader@24c09603, com.ibm.icu.impl.Norm2AllModes$ComposeNormalizer2@43b7067e)
  2> tokenizer=
  2>   org.apache.lucene.analysis.th.ThaiTokenizer(org.apache.lucene.util.AttributeFactory$1@a0c154ba)
  2> filters=
  2>   org.apache.lucene.analysis.no.NorwegianMinimalStemFilter(ValidatingTokenFilter@215f24e9 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,flags=0,payload=null,keyword=false)
  2>   org.apache.lucene.analysis.shingle.ShingleFilter(ValidatingTokenFilter@47f9add2 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,flags=0,payload=null,keyword=false, fg)
  2>   org.apache.lucene.analysis.shingle.FixedShingleFilter(ValidatingTokenFilter@71cd79a term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,flags=0,payload=null,keyword=false, 3, b, )
   >     java.lang.IllegalStateException: last stage: inconsistent endOffset at pos=1: 8 vs 11; token=ڎbdb觸 Ǧ

Gradle command to reproduce

./gradlew test --tests TestRandomChains.testRandomChains -Dtests.seed=3AEBBB09EA70DCE -Dtests.nightly=true -Dtests.locale=en-CA -Dtests.timezone=America/Cordoba -Dtests.asserts=true -Dtests.file.encoding=UTF-8
@benwtrent
Copy link
Member Author

git bisect says: 6cba773

It isn't immediately obvious why to me. @original-brownbear what do you think?

@rmuir
Copy link
Member

rmuir commented Apr 4, 2024

@benwtrent how long ago did you see this test failure?

I can't reproduce it with the latest main branch which has the ICU upgrade.
If i go back one commit to before the ICU upgrade to 342c814, then the test fails.

So I think ICU fixed the bug... i'll try to dig it up, I'm fairly certain we reported a bug related to this.

@rmuir
Copy link
Member

rmuir commented Apr 4, 2024

I can't find the specific bug, but we've reported several such issues to them in the past. I'm having trouble with their JIRA...

Either way, fails on 342c814 but succeeds on 54a2e11, so I think the bug is fixed?

@benwtrent
Copy link
Member Author

@rmuir

I can't reproduce it with the latest main branch which has the ICU upgrade.

Oh dang, I might not have fetched latest 🤦 Let me try again.

@rmuir
Copy link
Member

rmuir commented Apr 4, 2024

I just reproduced it on main too :(

So I think the problem is that it doesn't always reproduce: this likely caused your git-bisect confusion. But at least it isn't NEWLY introduced by the icu upgrade, it is a pre-existing condition...

@benwtrent
Copy link
Member Author

benwtrent commented Apr 4, 2024

OK, I simply commented out @original-brownbear 's change and always returned the meta and the test passes with the seed and settings. When I return the cached instance, it fails.

I noticed that the offsets are not 0. I don't know if that is causing some weirdness. I don't fully grok why offsets were not accounted for in the original change.

@original-brownbear
Copy link
Contributor

Looking into this now. Maybe it's the fact that for the all zeros case we now always have a single block meta ... let me check

@benwtrent
Copy link
Member Author

So, if I switch to just return new Meta(numValues, blockShift) when its all zero, the test passes :/. So it might be the single block meta deal...

@rmuir
Copy link
Member

rmuir commented Apr 4, 2024

I don't understand what that change has to do with analysis chain... inconsistent offsets has to do with what TokenStream is doing not the index. Be sure, that you aren't getting confused by the fact this failure does not reproduce 100% of the time.

as far as the ICU charfilter, but i added some prints so we can see what's happening:

  2> TEST FAIL: useCharFilter=true text='\u0003\ufb87\udacd\uddf6d\uf1f4\u02e6\u89f8\uda06\udfd2\u01e6'
  1> Normalizer2: NFKC
  2> Exception from random analyzer:
  ...

So ultimately on this string, the ICU charfilter will only change one character (the arabic presentation form FB87 to 068E). it won't change the length of the string in UTF-16 nor impact any offsets:

- \u0003\uFB87\uDACD\uDDF6\u0064\uF1F4\u02E6\u89F8\uDA06\uDFD2\u005C\u0075\u0030\u0031\u0065\u0036
+ \u0003\u068E\uDACD\uDDF6\u0064\uF1F4\u02E6\u89F8\uDA06\uDFD2\u005C\u0075\u0030\u0031\u0065\u0036

But that charfilter tries to do this incrementally, so it could have some bugs based on how data is being "spoon-fed" to the charfilter (spoonfeeding is happening: that's the useCharFilter=true). I suspect the bug still be in the charfilter logic...

@original-brownbear
Copy link
Contributor

original-brownbear commented Apr 4, 2024

++ to @rmuir I can reproduce this after reverting my changes, this doesn't seem to be related. A failure rate of maybe ~20% for me just means it needs a couple iterations to show at times.

@benwtrent
Copy link
Member Author

OK, apologies for the noise, this test keeps failing weirdly for me. Git bisect has failed me :)

@rmuir
Copy link
Member

rmuir commented Apr 4, 2024

i'll try to dig into it to at least find the offending component. If we can narrow it down to the problematic charfilter, tokenizer, or tokenfilter, we can make an easier-to-reproduce case.

In the past I've done this by creating a manual test (think, its a custom analyzer of the exact components printed out) that consumes the exact string and added it to "TestBugInSomething", until I can whittle it down.

Gonna need to move TestBugInSomething.java to the integration tests (analysis.tests) alongside TestRandomChains, so we can do this with failures that involve multiple analysis modules.

The fact that it only reproduces some of the time is also annoying and possibly a separate bug in the test of its own...

asfgit pushed a commit that referenced this issue Apr 4, 2024
This test is setup to reproduce complex failures from TestRandomChains,
e.g. it has SopFilter and other tools for debugging. But it still
resides in the analysis/common module and currently can't be used to
debug any TestRandomChains failures that use other modules (e.g. icu).

relates to #13271
asfgit pushed a commit that referenced this issue Apr 4, 2024
This test is setup to reproduce complex failures from TestRandomChains,
e.g. it has SopFilter and other tools for debugging. But it still
resides in the analysis/common module and currently can't be used to
debug any TestRandomChains failures that use other modules (e.g. icu).

relates to #13271
@rmuir
Copy link
Member

rmuir commented Apr 4, 2024

Attached is a reproducer in TestBugInSomething that seems to work. It is ugly due to bugs in javac, but we don't care as we won't keep it and will just whittle it down to a proper test that goes somewhere else. :)

./gradlew -p lucene/analysis.tests test --tests TestBugInSomething seems to fail every time with this, so the lack-of-reproducibility issue must be some more complex problem involving random seeds.

repro.patch.txt

@rmuir
Copy link
Member

rmuir commented Apr 4, 2024

I narrowed the fail down so far a bit to just this chain:

        new Analyzer() {
          @Override
          protected TokenStreamComponents createComponents(String fieldName) {
            Tokenizer t = new ThaiTokenizer();
            TokenStream stream = new ShingleFilter(t, "fg");
            stream = new FixedShingleFilter(stream, 3, "b", " ");
            return new TokenStreamComponents(t, stream);
          }
        };

This is nice as it involves no charfilter at all so we know ICU isn't involved.

@rmuir
Copy link
Member

rmuir commented Apr 4, 2024

I'll debug it some more... just need a break. Mainly I wanted to make sure I didn't introduce this with the ICU upgrade...

the two shinglefilters are suspect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants