Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PCRE "single-line mode" not properly represented in CTRE #282

Open
Minty-Meeo opened this issue Apr 27, 2023 · 6 comments
Open

PCRE "single-line mode" not properly represented in CTRE #282

Minty-Meeo opened this issue Apr 27, 2023 · 6 comments

Comments

@Minty-Meeo
Copy link

Minty-Meeo commented Apr 27, 2023

I want to preface this with the fact that I am quite inexperienced with regular expressions, so I may be wrong about some things.

When I created issue #281, the example I linked for CTRE used a ctre::multiline_starts_with. This was because it was a simplified snippet from a personal project I am attempting to convert to using CTRE. I intended to use ctre::starts_with, as that is the direct analogue for the std::regex mode I was using before. However, ctre::starts_with consistently caused stack overflow crashes. I have now discovered, through trial and error, why this was.

STL: https://godbolt.org/z/vP9YqGP3v
CTRE: https://godbolt.org/z/bedTY8jxo

I do not know how to describe, it, but it seems regular expressions of various flavors (when not in multi-line mode) have special rules for the '\n' and '\r' characters that CTRE does not follow. I found a website that helps support this claim: https://regex101.com/r/Syt781/1. Notice that the regex behaves identically in ECMAScript, PCRE, and PCRE2 modes. I say it is a special rule for these characters in particular because other characters, including escape sequences like '\a', do still result in the greedy capture going too far with std::regex: https://godbolt.org/z/1cj3KqMas.

@Minty-Meeo
Copy link
Author

I think there is code that tries to achieve this in ctre::evaluate, but it is hidden behind multi-line mode.

@Minty-Meeo
Copy link
Author

Minty-Meeo commented Apr 28, 2023

Here is a simplified example of the std::regex behavior on a string containing '\r' or '\n'.
https://godbolt.org/z/q555G3hdo
Even when not part of the expression, '\r' or '\n' halts any capture. I had no clue my project relied on this behavior until just today.

@Minty-Meeo
Copy link
Author

Minty-Meeo commented Apr 28, 2023

It seems like ECMAScript is the only flavor available to std::regex with this special rule for '\n' and '\r'. I don't know enough about PCRE to know if the same is true, or if this is the nature of "multi-line" mode for PCRE and it is simply on by default in any online examples I can find.

@iulian-rusu
Copy link

Even when not part of the expression, '\r' or '\n' halts any capture. I had no clue my project relied on this behavior until just today.

By default, the . metacharacter does not match line breaks (\r or \n). As far as I know, CTRE has the behavior that . matches anything by default, including line breaks. This is not the same as in std::regex, hence why it halted the capture once it found a line break.

Here is a useful website which explains how the dot character works. In short, there is this flag called "single-line" (or sometimes "dotall") which makes the dot actually match line breaks.

I usually use something like [\d\D] or sometimes [^] (if this syntax is supported) when I want to be absolutely sure the pattern will match anything.

@Minty-Meeo
Copy link
Author

I see, so this is a quirk exclusive to Perl-Compatible Regular Expressions. I think CTRE makes the mistake of assuming multi-line mode is the opposite of single-line mode, like this website says, as I found in the source code while making PR #283 that multi-line mode is what enables the behavior of never matching '\r' or '\n' for CTRE.

@Minty-Meeo
Copy link
Author

Oh dear, this documentation you linked says PCRE is supposed to allow configuring which characters are line endings. So my PR isn't really PCRE valid, now it just matches the std::regex behavior. This is a complicated topic.

@Minty-Meeo Minty-Meeo changed the title Special rule for '\n' and '\r' not found in CTRE PCRE "single-line mode" not properly represented in CTRE Apr 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants