Generating Rust code from regex-automata
types
#1192
Replies: 1 comment · 4 replies
-
I think I already responded to this on reddit right? I said:
Is there a question you had about this response? What kind of patterns are you trying to compile? Are you actually able to compile a pattern that is under but near the limit? |
Beta Was this translation helpful? Give feedback.
All reactions
-
Ah yes. Basically, I'm more looking for guidance on how to approach this, and space to debate how code generation may differ in it's overall approach from the runtime types you've baked into this crate. I feel like generated code that forms a DFA could be infinitely large (within reason) and may not have to rely on transition tables depending on how it was implemented. Therefore, state IDs may not be required at all. I'm still a little confused on how to use the trait exactly too. |
Beta Was this translation helpful? Give feedback.
All reactions
-
I'm sorry, I really just don't understand what you're asking here. Like, what is the debate you want to have? I just don't get it. I don't understand the parameters. I'm happy to engage to an extent, I just don't know where to start. My mind works better with concrete examples. I don't really deal with things like "support infinitely large DFAs." Because that isn't a concrete problem. It's a solution to a problem I don't understand. You're right that state IDs might not be required at all. But a state ID is just one implementation strategy. It's an implementation of a pointer. I could have used pointers instead of IDs, for example, and there wouldn't be what you might call an artificial limitation on the state ID. The implementation would just let you use as many pointers as your underlying system was willing to let you allocate. But you can't look at this in a vacuum. If you built a DFA that would otherwise blow the state ID limit, then how much Rust code would that be? Could Have you tried building DFAs before? Try building some and see how long it takes. Here, I'll get you started:
A simple regex consisting of 1000 word (Unicode-aware) word characters takes over 2 minutes to build and ~337MB of heap.
OK... I also said this on reddit:
Can you say more than just "I'm confused"? You've got to help me help you. :-) If I don't know what you're confused about, it's hard to target advice to you. What have you tried? Can you show me a program that you've tried to write but doesn't work how you expect/want? Or is there a conceptual gap here? If conceptual, maybe one way to help is to suggest that a finite state machine is "just" a special kind of graph. And you can use graph traversal algorithms to explore the finite state machine. |
Beta Was this translation helpful? Give feedback.
All reactions
-
Before anything else, I just want to be clear that I appreciate your time on this. I also appreciate that patience can likely ware thin when things are not explained in a way you're expecting, and by the sounds of it we're on two very different wave lengths with regards to this (i.e. you know a lot about Regex, Rust, and this crate and I know comparatively very little). I'm coming from an almost zero-knowledge perspective with regards to Regex theory. Ergo whatever my contribution to this will be, it'll likely not be backed by the same knowledge of this domain as you. However, I was hoping that you'd be open to guiding me in as specific or general direction as you'd be willing. So please, feel free to not spend time on this at all. I mean no offence! I really don't want to put you out in any way. To reiterate, I really do not expect you to have or feel obliged to provide all the answers. I merely thought you'd be ideally placed to provide valuable insight and direction given how much work you've done in this area.
Essentially, I'm interested in understanding the practicalities of generating Rust code that performs the various search functions and provides the various match types, supported by this crate. In a very similar way to re2c but with broader Regex support. As this would be a very different implementation to that which you've implemented in this crate, I'd like to understand what other limitations may come into play, and what considerations might need to be taken around other search related optimisations. Also, I believe this may introduce benefits that aren't currently possible in the Regex crate (the obvious one being compiler optimisations of static code). In any case, I'm not 100% about any of this and I don't know if it's worthy endeavour. I specifically mentioned the "too many states" problem because I thought logically it may be possible to generate code in a way that simply is not hindered by this limitation. I have use cases that hit this limit for DFAs such that they will not compile, but I work around this by using lazy DFAs. In any case, that is not what I wish to pursue here. I understand the trade-offs presented within the various types within the Regex crate and I'm happy enough with the way things are. I guess you could boil all this down to me being curious as to whether a Regex implementation that leverages code generation would provide different trade-offs that would be more desirable in certain use cases. Specifically, where Regexes may be changed at runtime but performance be of greater importance. E.g. could one compile code generated from a Regex into a dynamic library and swap them during runtime to get the best possible performance, at the cost of the logistics of changing the "built Regexes" requiring more careful orchestration than mutable memory. Is all that actually worth it?
Yes, I've played with (I think) all the Regex types within regex-automata. I mostly use hybrid (lazy) DFAs. I've also previously ran a modified fork (not published) that provides a
I think essentially what you're suggesting in your answer on reddit, is that in order to traverse a DFA one would essentially brute-force the inputs by feeding every possible u8 variant to the transition function, then "generate code" as you go? In this case, I really do not know where to start, as I don't understand how you ensure that you've navigated every edge. Maybe this is a fundamental gap in my knowledge of DFAs. Just reading the (description of the trait)[https://docs.rs/regex-automata/latest/regex_automata/dfa/trait.Automaton.html], there are several points that confuse me with regards to your answer and what a potential complete implementation would look like:
My current theory (I am yet to write a single line of code) is that you would do what I describe above and leverage the various Then thinking more widely, you have optimisations outside of the DFAs that improve search performance based on the provided regex pattern, that bring in other crates (e.g. AhoC). Establishing how these would translate to a code gen implementation is something I'd like to explore. Finally, as always, I appreciate your time and patience. Sometimes written comms doesn't translate properly. |
Beta Was this translation helpful? Give feedback.
All reactions
-
Oh I have no idea. You'd have to try it. I'm not really sure "compiler optimizations" will really matter much here. And I'm not even convinced that generated Rust code will lead to better overall performance than a table oriented DFA like the ones in this crate. And on top of that, Rust doesn't have You can likely do some quick experiments by comparing
The size of DFAs will always be limited by available resources. There's no way around that. The
I'll try this in pseudo code. Given a
This is basically taken straight from breadth-first graph traversal. I promise you that there is nothing "regex specific" about the code above. Really, just think of a finite state machine as a graph. There are some shortcuts one can take in the above code, but that's the bones of it. In order to generate Rust code for a DFA, you need to visit every state and all of its transitions.
How much docs have you read? I can't quite tell, but you should read Since a DFA can have multiple start states, the actual generated DFA in Rust code will also have multiple start states. So you'll need a preamble that computes the start state based on the look-behind byte and the anchored configuration. You could start simple by limiting yourself to only anchored searches and no support for look-around. Then there's only one start state. See
That's correct. The forward search would need to move forward through the haystack where as the reverse search would need to move backwards. But focus on just the forward search first. That will at least tell you the end of the match. Once you have forward search working, it shouldn't be too hard to essentially copy what you've done for the reverse case.
I would skip these on an initial implementation. It is always correct to do so. The main idea is that some states mostly consist of transitions that just loop back to itself. For example,
This should be your next step. You've thought a lot about this and gotten advice from me. It's time to put pencil to paper. :-)
Sounds right I think.
It's one thing to do codegen for a DFA. But to do codegen for everything else
Np :) |
Beta Was this translation helpful? Give feedback.
-
Hello @BurntSushi,
On Reddit I asked about using the Automaton trait to generate Rust code. Specifically to build a Regex search implementation that leverages generated code that can be statically compiled into a binary.
I understand there's a lot of other optimisation in play within the Regex crate, which I'd be keen to explore as well. E.g. literal optimisations.
I'd like to debate an implementation. In addition, I'd be keen to understand code generation for extremely large and complex patterns that currently cannot be compiled into a DFA due to an excessive number of states.
Given this, I don't know if it would be possible to explore other types within this crate for use in code generation.
Beta Was this translation helpful? Give feedback.
All reactions