org.apache.lucene.analysis.tests.TestRandomChains.testRandomChains test failure #13271

benwtrent · 2024-04-04T19:30:53Z

Description

Haven't finished bisection to figure out the origin of the bug.

org.apache.lucene.analysis.tests.TestRandomChains > testRandomChains FAILED
    java.lang.IllegalStateException: last stage: inconsistent endOffset at pos=1: 8 vs 11; token=ڎbdb觸 Ǧ
        at __randomizedtesting.SeedInfo.seed([3AEBBB09EA70DCE:3E4F92D1D9B5100E]:0)
        at [email protected]/org.apache.lucene.tests.analysis.ValidatingTokenFilter.incrementToken(ValidatingTokenFilter.java:164)
        at [email protected]/org.apache.lucene.tests.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:1289)
        at [email protected]/org.apache.lucene.tests.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:1187)
        at [email protected]/org.apache.lucene.tests.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:1081)
        at [email protected]/org.apache.lucene.analysis.tests.TestRandomChains.testRandomChains(TestRandomChains.java:945)
        at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)

  2> stage 0: ڎ<[1-2] +1> d<[4-5] +1> 觸<[7-8] +1> Ǧ<[10-11] +1>
  2> stage 1: ڎ<[1-2] +1> d<[4-5] +1> 觸<[7-8] +1> Ǧ<[10-11] +1>
  2> stage 2: ڎ<[1-2] +1> ڎ d<[1-5] +0> d<[4-5] +1> d 觸<[4-8] +0> 觸<[7-8] +1> 觸 Ǧ<[7-11] +0>
  2> last stage: ڎbdb觸<[1-8] +1> ڎbdb觸 Ǧ<[1-11] +0>
  2> TEST FAIL: useCharFilter=true text='\u0003\ufb87\udacd\uddf6d\uf1f4\u02e6\u89f8\uda06\udfd2\u01e6'
  2> Exception from random analyzer:
  2> charfilters=
  2>   org.apache.lucene.analysis.icu.ICUNormalizer2CharFilter(java.io.StringReader@24c09603, com.ibm.icu.impl.Norm2AllModes$ComposeNormalizer2@43b7067e)
  2> tokenizer=
  2>   org.apache.lucene.analysis.th.ThaiTokenizer(org.apache.lucene.util.AttributeFactory$1@a0c154ba)
  2> filters=
  2>   org.apache.lucene.analysis.no.NorwegianMinimalStemFilter(ValidatingTokenFilter@215f24e9 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,flags=0,payload=null,keyword=false)
  2>   org.apache.lucene.analysis.shingle.ShingleFilter(ValidatingTokenFilter@47f9add2 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,flags=0,payload=null,keyword=false, fg)
  2>   org.apache.lucene.analysis.shingle.FixedShingleFilter(ValidatingTokenFilter@71cd79a term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,flags=0,payload=null,keyword=false, 3, b, )
   >     java.lang.IllegalStateException: last stage: inconsistent endOffset at pos=1: 8 vs 11; token=ڎbdb觸 Ǧ

Gradle command to reproduce

./gradlew test --tests TestRandomChains.testRandomChains -Dtests.seed=3AEBBB09EA70DCE -Dtests.nightly=true -Dtests.locale=en-CA -Dtests.timezone=America/Cordoba -Dtests.asserts=true -Dtests.file.encoding=UTF-8

The text was updated successfully, but these errors were encountered:

benwtrent · 2024-04-04T19:39:42Z

git bisect says: 6cba773

It isn't immediately obvious why to me. @original-brownbear what do you think?

rmuir · 2024-04-04T19:47:42Z

@benwtrent how long ago did you see this test failure?

I can't reproduce it with the latest main branch which has the ICU upgrade.
If i go back one commit to before the ICU upgrade to 342c814, then the test fails.

So I think ICU fixed the bug... i'll try to dig it up, I'm fairly certain we reported a bug related to this.

rmuir · 2024-04-04T19:51:30Z

I can't find the specific bug, but we've reported several such issues to them in the past. I'm having trouble with their JIRA...

Either way, fails on 342c814 but succeeds on 54a2e11, so I think the bug is fixed?

benwtrent · 2024-04-04T19:52:07Z

@rmuir

I can't reproduce it with the latest main branch which has the ICU upgrade.

Oh dang, I might not have fetched latest 🤦 Let me try again.

rmuir · 2024-04-04T19:54:44Z

I just reproduced it on main too :(

So I think the problem is that it doesn't always reproduce: this likely caused your git-bisect confusion. But at least it isn't NEWLY introduced by the icu upgrade, it is a pre-existing condition...

benwtrent · 2024-04-04T20:12:04Z

OK, I simply commented out @original-brownbear 's change and always returned the meta and the test passes with the seed and settings. When I return the cached instance, it fails.

I noticed that the offsets are not 0. I don't know if that is causing some weirdness. I don't fully grok why offsets were not accounted for in the original change.

original-brownbear · 2024-04-04T20:14:28Z

Looking into this now. Maybe it's the fact that for the all zeros case we now always have a single block meta ... let me check

benwtrent · 2024-04-04T20:19:38Z

So, if I switch to just return new Meta(numValues, blockShift) when its all zero, the test passes :/. So it might be the single block meta deal...

rmuir · 2024-04-04T20:22:57Z

I don't understand what that change has to do with analysis chain... inconsistent offsets has to do with what TokenStream is doing not the index. Be sure, that you aren't getting confused by the fact this failure does not reproduce 100% of the time.

as far as the ICU charfilter, but i added some prints so we can see what's happening:

  2> TEST FAIL: useCharFilter=true text='\u0003\ufb87\udacd\uddf6d\uf1f4\u02e6\u89f8\uda06\udfd2\u01e6'
  1> Normalizer2: NFKC
  2> Exception from random analyzer:
  ...

So ultimately on this string, the ICU charfilter will only change one character (the arabic presentation form FB87 to 068E). it won't change the length of the string in UTF-16 nor impact any offsets:

- \u0003\uFB87\uDACD\uDDF6\u0064\uF1F4\u02E6\u89F8\uDA06\uDFD2\u005C\u0075\u0030\u0031\u0065\u0036
+ \u0003\u068E\uDACD\uDDF6\u0064\uF1F4\u02E6\u89F8\uDA06\uDFD2\u005C\u0075\u0030\u0031\u0065\u0036

But that charfilter tries to do this incrementally, so it could have some bugs based on how data is being "spoon-fed" to the charfilter (spoonfeeding is happening: that's the useCharFilter=true). I suspect the bug still be in the charfilter logic...

original-brownbear · 2024-04-04T20:24:55Z

++ to @rmuir I can reproduce this after reverting my changes, this doesn't seem to be related. A failure rate of maybe ~20% for me just means it needs a couple iterations to show at times.

benwtrent · 2024-04-04T20:29:28Z

OK, apologies for the noise, this test keeps failing weirdly for me. Git bisect has failed me :)

rmuir · 2024-04-04T20:38:35Z

i'll try to dig into it to at least find the offending component. If we can narrow it down to the problematic charfilter, tokenizer, or tokenfilter, we can make an easier-to-reproduce case.

In the past I've done this by creating a manual test (think, its a custom analyzer of the exact components printed out) that consumes the exact string and added it to "TestBugInSomething", until I can whittle it down.

Gonna need to move TestBugInSomething.java to the integration tests (analysis.tests) alongside TestRandomChains, so we can do this with failures that involve multiple analysis modules.

The fact that it only reproduces some of the time is also annoying and possibly a separate bug in the test of its own...

This test is setup to reproduce complex failures from TestRandomChains, e.g. it has SopFilter and other tools for debugging. But it still resides in the analysis/common module and currently can't be used to debug any TestRandomChains failures that use other modules (e.g. icu). relates to #13271

rmuir · 2024-04-04T21:18:33Z

Attached is a reproducer in TestBugInSomething that seems to work. It is ugly due to bugs in javac, but we don't care as we won't keep it and will just whittle it down to a proper test that goes somewhere else. :)

./gradlew -p lucene/analysis.tests test --tests TestBugInSomething seems to fail every time with this, so the lack-of-reproducibility issue must be some more complex problem involving random seeds.

repro.patch.txt

rmuir · 2024-04-04T21:28:11Z

I narrowed the fail down so far a bit to just this chain:

        new Analyzer() {
          @Override
          protected TokenStreamComponents createComponents(String fieldName) {
            Tokenizer t = new ThaiTokenizer();
            TokenStream stream = new ShingleFilter(t, "fg");
            stream = new FixedShingleFilter(stream, 3, "b", " ");
            return new TokenStreamComponents(t, stream);
          }
        };

This is nice as it involves no charfilter at all so we know ICU isn't involved.

rmuir · 2024-04-04T21:35:00Z

I'll debug it some more... just need a break. Mainly I wanted to make sure I didn't introduce this with the ICU upgrade...

the two shinglefilters are suspect.

benwtrent added the type:test label Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

org.apache.lucene.analysis.tests.TestRandomChains.testRandomChains test failure #13271

org.apache.lucene.analysis.tests.TestRandomChains.testRandomChains test failure #13271

benwtrent commented Apr 4, 2024

benwtrent commented Apr 4, 2024

rmuir commented Apr 4, 2024

rmuir commented Apr 4, 2024

benwtrent commented Apr 4, 2024

rmuir commented Apr 4, 2024

benwtrent commented Apr 4, 2024 •

edited

original-brownbear commented Apr 4, 2024

benwtrent commented Apr 4, 2024

rmuir commented Apr 4, 2024

original-brownbear commented Apr 4, 2024 •

edited

benwtrent commented Apr 4, 2024

rmuir commented Apr 4, 2024

rmuir commented Apr 4, 2024

rmuir commented Apr 4, 2024

rmuir commented Apr 4, 2024

org.apache.lucene.analysis.tests.TestRandomChains.testRandomChains test failure #13271

org.apache.lucene.analysis.tests.TestRandomChains.testRandomChains test failure #13271

Comments

benwtrent commented Apr 4, 2024

Description

Gradle command to reproduce

benwtrent commented Apr 4, 2024

rmuir commented Apr 4, 2024

rmuir commented Apr 4, 2024

benwtrent commented Apr 4, 2024

rmuir commented Apr 4, 2024

benwtrent commented Apr 4, 2024 • edited

original-brownbear commented Apr 4, 2024

benwtrent commented Apr 4, 2024

rmuir commented Apr 4, 2024

original-brownbear commented Apr 4, 2024 • edited

benwtrent commented Apr 4, 2024

rmuir commented Apr 4, 2024

rmuir commented Apr 4, 2024

rmuir commented Apr 4, 2024

rmuir commented Apr 4, 2024

benwtrent commented Apr 4, 2024 •

edited

original-brownbear commented Apr 4, 2024 •

edited