Generating Rust code from `regex-automata` types #1192

AshleySchaeffer · 2024-05-08T20:28:34Z

AshleySchaeffer
May 8, 2024

On Reddit I asked about using the Automaton trait to generate Rust code. Specifically to build a Regex search implementation that leverages generated code that can be statically compiled into a binary.

I understand there's a lot of other optimisation in play within the Regex crate, which I'd be keen to explore as well. E.g. literal optimisations.

I'd like to debate an implementation. In addition, I'd be keen to understand code generation for extremely large and complex patterns that currently cannot be compiled into a DFA due to an excessive number of states.

Given this, I don't know if it would be possible to explore other types within this crate for use in code generation.

BurntSushi · 2024-05-08T21:07:27Z

BurntSushi
May 8, 2024
Maintainer

I'd be keen to understand code generation for extremely large and complex patterns that currently cannot be compiled into a DFA due to an excessive number of states.

I think I already responded to this on reddit right? I said:

You can't build a DFA with more than 2^32 states using regex-automata. There is no getting around it. A DFA bigger than that is likely impractical to build anyway. DFA building isn't a scalable operation. It's just going to peg your CPU for ~forever. All other limits (except for the number of patterns, also limited to 2^32) in the library are optional and disabled by default.

Is there a question you had about this response? What kind of patterns are you trying to compile? Are you actually able to compile a pattern that is under but near the limit?

4 replies

AshleySchaeffer May 8, 2024
Author

Ah yes. Basically, I'm more looking for guidance on how to approach this, and space to debate how code generation may differ in it's overall approach from the runtime types you've baked into this crate.

I feel like generated code that forms a DFA could be infinitely large (within reason) and may not have to rely on transition tables depending on how it was implemented. Therefore, state IDs may not be required at all.

I'm still a little confused on how to use the trait exactly too.

BurntSushi May 8, 2024
Maintainer

I'm sorry, I really just don't understand what you're asking here. Like, what is the debate you want to have? I just don't get it. I don't understand the parameters. I'm happy to engage to an extent, I just don't know where to start. My mind works better with concrete examples. I don't really deal with things like "support infinitely large DFAs." Because that isn't a concrete problem. It's a solution to a problem I don't understand.

You're right that state IDs might not be required at all. But a state ID is just one implementation strategy. It's an implementation of a pointer. I could have used pointers instead of IDs, for example, and there wouldn't be what you might call an artificial limitation on the state ID. The implementation would just let you use as many pointers as your underlying system was willing to let you allocate.

But you can't look at this in a vacuum. If you built a DFA that would otherwise blow the state ID limit, then how much Rust code would that be? Could rustc even compile it? What limits does it have? This is why I can't really meet you in this hypothetical world without constraining the space of discussion because it isn't clear to me how you solve other problems caused by a large DFA. Like, yes, true, the state ID limit is somewhat artificial, but the limit is so large as to be not much different than other more natural limitations (like available memory or time/CPU). This means that focusing on the state ID limit to the exclusion of everything is not something I really know how to do.

Have you tried building DFAs before? Try building some and see how long it takes. Here, I'll get you started:

$ regex-cli debug dense dfa -q '\w{1000}'
      parse time:  13.092µs
  translate time:  6.661µs
compile nfa time:  74.356378ms
compile dfa time:  124.928920666s
          memory:  337499712
     pattern len:  1
      start kind:  Both
    alphabet len:  113
          stride:  128
      has empty?:  false
        is utf8?:  true

A simple regex consisting of 1000 word (Unicode-aware) word characters takes over 2 minutes to build and ~337MB of heap.

I'm still a little confused on how to use the trait exactly too.

OK... I also said this on reddit:

You get the start state and then use the transition function from there. The alphabet is all distinct values of u8, plus the special EOI symbol.

Can you say more than just "I'm confused"? You've got to help me help you. :-) If I don't know what you're confused about, it's hard to target advice to you. What have you tried? Can you show me a program that you've tried to write but doesn't work how you expect/want? Or is there a conceptual gap here? If conceptual, maybe one way to help is to suggest that a finite state machine is "just" a special kind of graph. And you can use graph traversal algorithms to explore the finite state machine.

AshleySchaeffer May 9, 2024
Author

Before anything else, I just want to be clear that I appreciate your time on this. I also appreciate that patience can likely ware thin when things are not explained in a way you're expecting, and by the sounds of it we're on two very different wave lengths with regards to this (i.e. you know a lot about Regex, Rust, and this crate and I know comparatively very little). I'm coming from an almost zero-knowledge perspective with regards to Regex theory. Ergo whatever my contribution to this will be, it'll likely not be backed by the same knowledge of this domain as you. However, I was hoping that you'd be open to guiding me in as specific or general direction as you'd be willing. So please, feel free to not spend time on this at all. I mean no offence! I really don't want to put you out in any way. To reiterate, I really do not expect you to have or feel obliged to provide all the answers. I merely thought you'd be ideally placed to provide valuable insight and direction given how much work you've done in this area.

I'm sorry, I really just don't understand what you're asking here. Like, what is the debate you want to have? I just don't get it. I don't understand the parameters. I'm happy to engage to an extent, I just don't know where to start. My mind works better with concrete examples. I don't really deal with things like "support infinitely large DFAs." Because that isn't a concrete problem. It's a solution to a problem I don't understand.

You're right that state IDs might not be required at all. But a state ID is just one implementation strategy. It's an implementation of a pointer. I could have used pointers instead of IDs, for example, and there wouldn't be what you might call an artificial limitation on the state ID. The implementation would just let you use as many pointers as your underlying system was willing to let you allocate.

But you can't look at this in a vacuum. If you built a DFA that would otherwise blow the state ID limit, then how much Rust code would that be? Could rustc even compile it? What limits does it have? This is why I can't really meet you in this hypothetical world without constraining the space of discussion because it isn't clear to me how you solve other problems caused by a large DFA. Like, yes, true, the state ID limit is somewhat artificial, but the limit is so large as to be not much different than other more natural limitations (like available memory or time/CPU). This means that focusing on the state ID limit to the exclusion of everything is not something I really know how to do.

Essentially, I'm interested in understanding the practicalities of generating Rust code that performs the various search functions and provides the various match types, supported by this crate. In a very similar way to re2c but with broader Regex support. As this would be a very different implementation to that which you've implemented in this crate, I'd like to understand what other limitations may come into play, and what considerations might need to be taken around other search related optimisations. Also, I believe this may introduce benefits that aren't currently possible in the Regex crate (the obvious one being compiler optimisations of static code). In any case, I'm not 100% about any of this and I don't know if it's worthy endeavour.

I specifically mentioned the "too many states" problem because I thought logically it may be possible to generate code in a way that simply is not hindered by this limitation. I have use cases that hit this limit for DFAs such that they will not compile, but I work around this by using lazy DFAs. In any case, that is not what I wish to pursue here. I understand the trade-offs presented within the various types within the Regex crate and I'm happy enough with the way things are.

I guess you could boil all this down to me being curious as to whether a Regex implementation that leverages code generation would provide different trade-offs that would be more desirable in certain use cases. Specifically, where Regexes may be changed at runtime but performance be of greater importance. E.g. could one compile code generated from a Regex into a dynamic library and swap them during runtime to get the best possible performance, at the cost of the logistics of changing the "built Regexes" requiring more careful orchestration than mutable memory. Is all that actually worth it?

Have you tried building DFAs before?

Yes, I've played with (I think) all the Regex types within regex-automata. I mostly use hybrid (lazy) DFAs. I've also previously ran a modified fork (not published) that provides a meta::Regex type that only leverages DFA types and/or uses literal optimisation. I never got round to benchmarking this, but I did it mostly out of interest and exploration than anything particularly practical. That being said, IIRC DFAs generally perform better than NFAs (don't quote me on that 😄).

Can you say more than just "I'm confused"? You've got to help me help you. :-) If I don't know what you're confused about, it's hard to target advice to you. What have you tried? Can you show me a program that you've tried to write but doesn't work how you expect/want? Or is there a conceptual gap here? If conceptual, maybe one way to help is to suggest that a finite state machine is "just" a special kind of graph. And you can use graph traversal algorithms to explore the finite state machine.

I think essentially what you're suggesting in your answer on reddit, is that in order to traverse a DFA one would essentially brute-force the inputs by feeding every possible u8 variant to the transition function, then "generate code" as you go? In this case, I really do not know where to start, as I don't understand how you ensure that you've navigated every edge. Maybe this is a fundamental gap in my knowledge of DFAs. Just reading the (description of the trait)[https://docs.rs/regex-automata/latest/regex_automata/dfa/trait.Automaton.html], there are several points that confuse me with regards to your answer and what a potential complete implementation would look like:

A DFA can have multiple start states - how would this work if my suggested approach is the right way of doing this? How would you identify multiple start states?
Computing the start state also depends on whether you’re doing a forward or a reverse search - I'm guessing this would mean one would need to generate code for a forward and reverse search? Otherwise find search functions wouldn't work as IIRC, a forward DFA can only find the end of a match, and a reverse DFA can only find the start. One must use both to find the needle.
Use of the accelerator states sounds particular useful and yet complex - I'm struggling wrapping my head around how accelerator states work in general?

My current theory (I am yet to write a single line of code) is that you would do what I describe above and leverage the various is_X_state methods of the Automaton trait to identify the various noteworthy DFA states that would dictate code generation ends.

Then thinking more widely, you have optimisations outside of the DFAs that improve search performance based on the provided regex pattern, that bring in other crates (e.g. AhoC). Establishing how these would translate to a code gen implementation is something I'd like to explore.

Finally, as always, I appreciate your time and patience. Sometimes written comms doesn't translate properly.

BurntSushi May 9, 2024
Maintainer

I guess you could boil all this down to me being curious as to whether a Regex implementation that leverages code generation would provide different trade-offs that would be more desirable in certain use cases. Specifically, where Regexes may be changed at runtime but performance be of greater importance. E.g. could one compile code generated from a Regex into a dynamic library and swap them during runtime to get the best possible performance, at the cost of the logistics of changing the "built Regexes" requiring more careful orchestration than mutable memory. Is all that actually worth it?

Oh I have no idea. You'd have to try it. I'm not really sure "compiler optimizations" will really matter much here. And I'm not even convinced that generated Rust code will lead to better overall performance than a table oriented DFA like the ones in this crate. And on top of that, Rust doesn't have goto, which makes at least some part of code generation potentially tricky. You'll likely need to rely on compiler optimizations to get the right Assembly codegen. But beyond that, I don't really see how generic compiler optimizations are going to help much here. And like I said, if you're really building DFAs that blow the state ID limit, then I gotta imagine that that will correspond to megabytes of source code right? Who knows what rustc is going to do with it...

You can likely do some quick experiments by comparing re2c perf with the DFAs in this crate. Or even just spend an afternoon writing the code to generate Rust code from a DFA.

I specifically mentioned the "too many states" problem because I thought logically it may be possible to generate code in a way that simply is not hindered by this limitation.

The size of DFAs will always be limited by available resources. There's no way around that.

The StateID limit is technically artificial, but I already covered that. If you specifically needed to avoid that, my recommendation would be to compile a thompson::NFA and then do your own DFA determinization. You could start by copying the code from this crate that does determinization and maybe using usize instead of StateID. It would be a thorny change, but doable.

I think essentially what you're suggesting in your answer on reddit, is that in order to traverse a DFA one would essentially brute-force the inputs by feeding every possible u8 variant to the transition function, then "generate code" as you go? In this case, I really do not know where to start, as I don't understand how you ensure that you've navigated every edge.

I'll try this in pseudo code. Given a dfa that implements the Automaton trait:

Initialize an empty VecDeque<StateID> called queue.
Initialize an empty HashSet<StateID> called visited.
Define a function add(sid: StateID) that adds sid to the back of queue and to visited if and only if it wasn't previously in visited.
Initialize queue with all unique start states:
a. for look_behind in (0..=255).map(Some).chain([None]):
1. config := Config::new().look_behind(look_behind)
2. add(dfa.start_state(&config.clone().anchored(Anchored::No)))
3. add(dfa.start_state(&config.clone().anchored(Anchored::Yes)))
4. for pid in 0..dfa.pattern_len():
  a. add(dfa.start_state(&config.clone().anchored(Anchored::Pattern(pid)))
while let Some(sid) = queue.pop_front():
a. for byte in 0..=255:
1. add(dfa.next_state(sid, byte))
  b. add(dfa.next_eoi_state(sid))
  c. VISIT (this is where you might generate Rust code for sid and where you might use the Automaton::is_XXX_state methods)

This is basically taken straight from breadth-first graph traversal. I promise you that there is nothing "regex specific" about the code above. Really, just think of a finite state machine as a graph.

There are some shortcuts one can take in the above code, but that's the bones of it. In order to generate Rust code for a DFA, you need to visit every state and all of its transitions.

A DFA can have multiple start states - how would this work if my suggested approach is the right way of doing this? How would you identify multiple start states?

How much docs have you read? I can't quite tell, but you should read Automaton::start_state. And then read the docs for regex_automata::util::start::Config.

Since a DFA can have multiple start states, the actual generated DFA in Rust code will also have multiple start states. So you'll need a preamble that computes the start state based on the look-behind byte and the anchored configuration. You could start simple by limiting yourself to only anchored searches and no support for look-around. Then there's only one start state. See Automaton::universal_start_state.

Computing the start state also depends on whether you’re doing a forward or a reverse search - I'm guessing this would mean one would need to generate code for a forward and reverse search? Otherwise find search functions wouldn't work as IIRC, a forward DFA can only find the end of a match, and a reverse DFA can only find the start. One must use both to find the needle.

That's correct. The forward search would need to move forward through the haystack where as the reverse search would need to move backwards.

But focus on just the forward search first. That will at least tell you the end of the match. Once you have forward search working, it shouldn't be too hard to essentially copy what you've done for the reverse case.

Use of the accelerator states sounds particular useful and yet complex - I'm struggling wrapping my head around how accelerator states work in general?

I would skip these on an initial implementation. It is always correct to do so.

The main idea is that some states mostly consist of transitions that just loop back to itself. For example, (?-u:[^a]+a). When only one, two or three transitions loop to some other state, we can "accelerate" matching when we enter that state by running memchr on those bytes. Because we know that we'll never leave that state unless we see one of those bytes. So there's no point in doing byte-at-a-time DFA traversal over and over again. Conversely, memchr uses SIMD to be quite fast.

(I am yet to write a single line of code)

This should be your next step. You've thought a lot about this and gotten advice from me. It's time to put pencil to paper. :-)

My current theory (I am yet to write a single line of code) is that you would do what I describe above and leverage the various is_X_state methods of the Automaton trait to identify the various noteworthy DFA states that would dictate code generation ends.

Sounds right I think.

Then thinking more widely, you have optimisations outside of the DFAs that improve search performance based on the provided regex pattern, that bring in other crates (e.g. AhoC). Establishing how these would translate to a code gen implementation is something I'd like to explore.

It's one thing to do codegen for a DFA. But to do codegen for everything else regex-automata does is a mammoth under-taking. If you want to go that route, I suggest one piece at a time.

Finally, as always, I appreciate your time and patience. Sometimes written comms doesn't translate properly.

Np :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generating Rust code from `regex-automata` types #1192

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Generating Rust code from regex-automata types #1192

AshleySchaeffer May 8, 2024

Replies: 1 comment · 4 replies

BurntSushi May 8, 2024 Maintainer

AshleySchaeffer May 8, 2024 Author

BurntSushi May 8, 2024 Maintainer

AshleySchaeffer May 9, 2024 Author

BurntSushi May 9, 2024 Maintainer

Generating Rust code from `regex-automata` types #1192

AshleySchaeffer
May 8, 2024

Replies: 1 comment 4 replies

BurntSushi
May 8, 2024
Maintainer

AshleySchaeffer May 8, 2024
Author

BurntSushi May 8, 2024
Maintainer

AshleySchaeffer May 9, 2024
Author

BurntSushi May 9, 2024
Maintainer