-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[API Proposal]: Regex Streaming Mode #102221
Comments
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions |
PTAL @stephentoub |
While I understand the desirability of the feature, this is basically impossible for most of the engines today (e.g. the backtracking interpreter engine). In theory it could be done for the NonBacktracking engine, but even there if it was for anything more than answering "is there a match" (which would be feasible with some work in the non-backtracking engine), such as needing to know the bounds of the match or the captures in the match, it would still require multiple passes over the relevant portion of the data, which could mean needing to have access to all of it. If this were processing a seekable stream, then I expect it could be made to work, but if it's just being handed individual pieces of data, matches could straddle those segments in a way that could force the implementation to copy all of it. |
Closing based on comments above; this feature would support a limited set of scenarios when used with a Stream. |
Background and motivation
Currently, the Regex engine can only match on the passed in
input
parameter, which is either aString
orReadOnlySpan<char>
. When running a Regex match against a file of indeterminate length, this can cause issues given that you have to load the entire file into memory before passing into the Regex. If a user attempts to stream the file through by just callingMatch
on each block ofReadOnlySpan
, then matches that would span across blocks could potentially be missed since the match state is reset on each invocation. It would be beneficial to preserve state between calls toMatch
such that the ending state from the previous invocation becomes the initial state for the next invocation. This would allow a user to stream the file through the Regex engine without blowing up memory or missing matches.This would function similar to streaming mode in hyperscan.
After (naively) reviewing the way that source generated Regex instances work, this seems like it would be well inline with how RegexRunner works already. The Match state seems to be captured in the active
RegexRunner
instance, which is recreated on each invocation. It would be beneficial if the lifecycle of the runner was controlled by the caller rather than having it be an internal implementation detail.API Proposal
API Usage
Alternative Designs
No response
Risks
No response
The text was updated successfully, but these errors were encountered: