Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: Regex Streaming Mode #102221

Closed
smbecker opened this issue May 14, 2024 · 4 comments
Closed

[API Proposal]: Regex Streaming Mode #102221

smbecker opened this issue May 14, 2024 · 4 comments
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Text.RegularExpressions

Comments

@smbecker
Copy link

Background and motivation

Currently, the Regex engine can only match on the passed in input parameter, which is either a String or ReadOnlySpan<char>. When running a Regex match against a file of indeterminate length, this can cause issues given that you have to load the entire file into memory before passing into the Regex. If a user attempts to stream the file through by just calling Match on each block of ReadOnlySpan, then matches that would span across blocks could potentially be missed since the match state is reset on each invocation. It would be beneficial to preserve state between calls to Match such that the ending state from the previous invocation becomes the initial state for the next invocation. This would allow a user to stream the file through the Regex engine without blowing up memory or missing matches.

This would function similar to streaming mode in hyperscan.

After (naively) reviewing the way that source generated Regex instances work, this seems like it would be well inline with how RegexRunner works already. The Match state seems to be captured in the active RegexRunner instance, which is recreated on each invocation. It would be beneficial if the lifecycle of the runner was controlled by the caller rather than having it be an internal implementation detail.

API Proposal

namespace System.Text.RegularExpressions;

public class Regex
{
    // Change visibility on current private method 
    public RegexRunner CreateRunner();

    // Add overloads that accept the current runner
    public bool IsMatch(ReadOnlySpan<char> input, RegexRunner runner);
    public Regex.ValueMatchEnumerator EnumerateMatches(ReadOnlySpan<char> input, RegexRunner runner);
}

API Usage

[GeneratedRegex("...")]
private static partial Regex regex;

var rentedBuffer = ArrayPool<char>.Shared.Rent(4096);
try {
	var buffer = rentedBuffer.AsSpan();
	using var file = File.OpenRead("/file.txt");
	using var reader = new StreamReader(file, Encoding.UTF8);
	var runner = regex.CreateRunner(); 
	while (!reader.EndOfStream) {
		var read = reader.ReadBlock(buffer);
		if (read == 0) {
			break;
		}
		
		foreach (var match in regex.EnumerateMatches(buffer[..read], runner)) {
			Console.WriteLine(buffer.Slice(match.Index, match.Length).ToString());
		}
	}
} finally {
	ArrayPool<char>.Shared.Return(rentedBuffer);
}

Alternative Designs

No response

Risks

No response

@smbecker smbecker added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label May 14, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label May 14, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

@steveharter
Copy link
Member

PTAL @stephentoub

@stephentoub
Copy link
Member

stephentoub commented May 15, 2024

While I understand the desirability of the feature, this is basically impossible for most of the engines today (e.g. the backtracking interpreter engine). In theory it could be done for the NonBacktracking engine, but even there if it was for anything more than answering "is there a match" (which would be feasible with some work in the non-backtracking engine), such as needing to know the bounds of the match or the captures in the match, it would still require multiple passes over the relevant portion of the data, which could mean needing to have access to all of it. If this were processing a seekable stream, then I expect it could be made to work, but if it's just being handed individual pieces of data, matches could straddle those segments in a way that could force the implementation to copy all of it.

@steveharter
Copy link
Member

Closing based on comments above; this feature would support a limited set of scenarios when used with a Stream.

@steveharter steveharter closed this as not planned Won't fix, can't repro, duplicate, stale May 29, 2024
@dotnet-policy-service dotnet-policy-service bot removed the untriaged New issue has not been triaged by the area owner label May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Text.RegularExpressions
Projects
None yet
Development

No branches or pull requests

3 participants