`regex::bytes::Regex::is_match` with a simple pattern with long sequences of wildcards is significantly slower than a naïve alternative #1141

expenses · 2023-12-23T10:31:12Z

What version of regex are you using?

v1.10.2

Describe the bug at a high level.

I'm trying to match 2D patterns inside 2D flattened array of u8s. For example, to match

0
1

on a 1024x1024 array I'm this pattern: "\\u{0}.{1023}\\u{1}". I'm compiling with

regex::bytes::RegexBuilder::new(&string)
    .unicode(false)
    .dot_matches_new_line(true)
    .build()

My expectation is that this kind of pattern would be at-worst slightly slower than the naïve alternative that follows:

enum RegexOp {
    Match(Vec<u8>),
    Skip(usize)
}

fn regex_match(ops: &[RegexOp], mut slice: &[u8]) -> bool {
    for op in ops {
        match op {
            RegexOp::Skip(skip) => slice = &slice[*skip..],
            RegexOp::Match(m) => {
                if !slice.starts_with(m) {
                    return false;
                }
                slice = &slice[m.len()..];
            }
        }
    }

    true
}

In this case ops looks like [Match([1]), Skip(1023), Match([0])].

Let me know if this is too much to ask.

Currently the regex solution takes 27.70 seconds for a 256x256 workload according to time:

________________________________________________________
Executed in   27.70 secs    fish           external
   usr time   27.65 secs  827.00 micros   27.65 secs
   sys time    0.60 secs   60.00 micros    0.60 secs

Whereas a 2-line change from is_match to my regex_match function drops this down to 147.17 ms (or 517ms if you're taking the longer value, doesn't really matter).

________________________________________________________
Executed in  147.17 millis    fish           external
   usr time  154.46 millis  663.00 micros  153.80 millis
   sys time  517.86 millis  183.00 micros  517.68 millis

I can provide proper benchmarks if you want. I am of-course using release mode and the standard set of regex crate features.

Please let me know if I've made some mistake in RegexBuilder or if a seperate set of feature flags might improve things.

What are the steps to reproduce the behavior?

I'm happy to write a test program if that's required for this issue to get attention, but I believe it should be trivial to write one given what I've written above.

What is the actual behavior?

N/A

What is the expected behavior?

N/A

What do you expect the output to be?

N/A

Related: #590 (comment).

The text was updated successfully, but these errors were encountered:

expenses · 2023-12-23T11:28:02Z

Oh sorry, I just stepped back for a second and realized that this is an slightly unfair comparison. regex_match checks if there is a match directly at the start of the slice, while regex::bytes::Regex::is_match checks for a match anywhere in the slice. The slices I'm passing as input are exactly the same size as the pattern and either have a matching pattern or not, so in this particular case there shouldn't really be a difference.

Tomorrow I'll try and write an equivalent of regex::bytes::Regex::find as I'm not super happy with the performance of that with these patterns either. I'll use that to generate another comparison.

BurntSushi · 2023-12-23T13:14:38Z

I'm happy to write a test program if that's required for this issue to get attention

Yes, please provide a reproduction program. While you did provide a good amount of details, it's no substitute for the real thing. Otherwise, I have to spend time deciphering exactly what you meant in your prose and possibly make guesses on things. But if you give me the code, the inputs and the specific commands you're running, then it takes all of the guesswork out of the process. I can then be reasonably sure that I'm looking at the same thing you are.

With that said...

My expectation is that this kind of pattern would be at-worst slightly slower than the naïve alternative that follows:

I think your expectation is way off personally. I wouldn't call your code naive in any way. It's a bespoke targeted solution to a specific problem. There are a variety of reasons why it might be faster here:

My guess is that your benchmark is latency oriented. Your bespoke code is small and likely has little to almost no overhead at all. Compare that with a regex engine which has to execute a pile code for every search before it even gets to executing the search itself.
Regex engines support a very general pattern language, and it just doesn't take advantage of every possible optimization opportunity. Sometimes it's because it's difficult to do. Sometimes it's because it might not be worth it. And in other less interesting cases, it's just because nobody has attempt to implement it. In this case, from what I can see, your bespoke matcher is doing something very critical to performance here that the regex engine is likely not doing: it's recognizing the bounded repeat as a simple skip instruction and literally jumping over an entire portion of the haystack. The regex engine, by contrast, is still going to match . 1023 times one byte at a time. That's going to lose---and lose big---to the simple pointer offset you've implemented.

With respect to (2), it's possible the regex engine could implement this optimization, but it would be limited to specific cases. And I haven't thought through all of it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`regex::bytes::Regex::is_match` with a simple pattern with long sequences of wildcards is significantly slower than a naïve alternative #1141

`regex::bytes::Regex::is_match` with a simple pattern with long sequences of wildcards is significantly slower than a naïve alternative #1141

expenses commented Dec 23, 2023 •

edited

expenses commented Dec 23, 2023

BurntSushi commented Dec 23, 2023

regex::bytes::Regex::is_match with a simple pattern with long sequences of wildcards is significantly slower than a naïve alternative #1141

regex::bytes::Regex::is_match with a simple pattern with long sequences of wildcards is significantly slower than a naïve alternative #1141

Comments

expenses commented Dec 23, 2023 • edited

What version of regex are you using?

Describe the bug at a high level.

What are the steps to reproduce the behavior?

What is the actual behavior?

What is the expected behavior?

expenses commented Dec 23, 2023

BurntSushi commented Dec 23, 2023

`regex::bytes::Regex::is_match` with a simple pattern with long sequences of wildcards is significantly slower than a naïve alternative #1141

`regex::bytes::Regex::is_match` with a simple pattern with long sequences of wildcards is significantly slower than a naïve alternative #1141

expenses commented Dec 23, 2023 •

edited