New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
regex::bytes::Regex::is_match
with a simple pattern with long sequences of wildcards is significantly slower than a naïve alternative
#1141
Comments
Oh sorry, I just stepped back for a second and realized that this is an slightly unfair comparison. Tomorrow I'll try and write an equivalent of |
Yes, please provide a reproduction program. While you did provide a good amount of details, it's no substitute for the real thing. Otherwise, I have to spend time deciphering exactly what you meant in your prose and possibly make guesses on things. But if you give me the code, the inputs and the specific commands you're running, then it takes all of the guesswork out of the process. I can then be reasonably sure that I'm looking at the same thing you are. With that said...
I think your expectation is way off personally. I wouldn't call your code naive in any way. It's a bespoke targeted solution to a specific problem. There are a variety of reasons why it might be faster here:
With respect to (2), it's possible the regex engine could implement this optimization, but it would be limited to specific cases. And I haven't thought through all of it. |
What version of regex are you using?
v1.10.2
Describe the bug at a high level.
I'm trying to match 2D patterns inside 2D flattened array of
u8
s. For example, to matchon a 1024x1024 array I'm this pattern:
"\\u{0}.{1023}\\u{1}"
. I'm compiling withMy expectation is that this kind of pattern would be at-worst slightly slower than the naïve alternative that follows:
In this case
ops
looks like[Match([1]), Skip(1023), Match([0])]
.Let me know if this is too much to ask.
Currently the regex solution takes 27.70 seconds for a 256x256 workload according to
time
:Whereas a 2-line change from
is_match
to myregex_match
function drops this down to 147.17 ms (or 517ms if you're taking the longer value, doesn't really matter).I can provide proper benchmarks if you want. I am of-course using release mode and the standard set of
regex
crate features.Please let me know if I've made some mistake in
RegexBuilder
or if a seperate set of feature flags might improve things.What are the steps to reproduce the behavior?
I'm happy to write a test program if that's required for this issue to get attention, but I believe it should be trivial to write one given what I've written above.
What is the actual behavior?
N/A
What is the expected behavior?
N/A
What do you expect the output to be?
N/A
Related: #590 (comment).
The text was updated successfully, but these errors were encountered: