Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stdlib: add support regexp match and replace for strings #107

Open
mikedanese opened this issue Feb 18, 2016 · 22 comments · May be fixed by #665
Open

stdlib: add support regexp match and replace for strings #107

mikedanese opened this issue Feb 18, 2016 · 22 comments · May be fixed by #665

Comments

@mikedanese
Copy link
Contributor

Would be easy to implement as a builtin.

https://github.com/google/re2

@sparkprime
Copy link
Member

Hard in Jsonnet, easy as a builtin, but the trick is to make sure every language that someone might want to implement Jsonnet in (and therefore have to provide an implementation for each of the builtins) has a native regular expression library with exactly the same regex syntax and semantics.

@mikedanese
Copy link
Contributor Author

PCRE seems to be implemented in many languages. Perhaps it would be better to implement this as a native extension (#108) to the language and not part of the core.

@sparkprime
Copy link
Member

Does it support unicode typically?

@nand0p
Copy link

nand0p commented May 20, 2016

+1

@benley
Copy link
Contributor

benley commented May 20, 2016

It appears that PCRE does typically support unicode: http://man7.org/linux/man-pages/man3/pcreunicode.3.html

@sparkprime
Copy link
Member

Would be great if the 3 of you could offer some real use cases for this functionality so I can figure out how to prioritize it.

@nand0p
Copy link

nand0p commented May 20, 2016

For our use case, we need to strip all non-alphanumeric chars from a string variable. i am thinking this can be done currently by splitting string to array of chars, checking each char, and then rejoining.... but that seems very ugly.

This may be able to be done more easily with a new function like std.toAlpha(x), but i would think full-on regex capabilities would be a more complete solution.

@sparkprime
Copy link
Member

It's not too bad but I can see why you'd rather write it with 0-9A-Za-z type ranges and on one line.

local is_alpha(x) =
    std.setMember(x,"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz");
local to_alpha(str) = std.join("", std.filter(is_alpha, std.stringChars(str)));
to_alpha("He.llo World123")

@emmanuel
Copy link

I have a case where I'd like to be able to replace all instances of - with _. That seems even more cumbersome than the filter-to-alphanumeric use-case discussed above. Regex is overkill for my use-case, but either regex-based or tr-style replace functionality would be helpful.

On a nearer-term note, can the stdlib/built-ins be composed to produce a 'replace each instance of character x with an instance of character y' behavior? I'm not coming up with it, though I'm quite new to Jsonnet.

@sbarzowski
Copy link
Collaborator

sbarzowski commented Dec 13, 2017

Here's an example implementation of proposed 'replace each instance of character x with an instance of character y'

local replaceChars(str, mapping) =
    std.join("", std.map(function(c) if c in mapping then mapping[c] else c, std.stringChars(str)));
replaceChars("abcd", {"b": "!", "d": "?"})

produces:
"a!c?"

The implementation above also support deleting characters or replacing them with multiple characters:
replaceChars("abcd", {"b": "xx", "d": ""})
produces:
"axxc"

It may be useful generally enough to add it to stdlib. @sparkprime what do you think?

@emmanuel
Copy link

@sbarzowski thank you, that is fantastic. Not only a clean interface to accomplish what I'm looking to get done, but also a good bit of insight about how to approach programming jsonnet. I'd be in favor of adding this to stdlib, but I'm not on the hook for maintenance, so perhaps merely adding to the documentation would be sufficient to help future seekers like myself.

Whatever the decision about adding to stdlib or docs, thank you for the help @sbarzowski.

@sparkprime
Copy link
Member

tr-like functionality is definitely a good candidate for stdlib.

@sparkprime
Copy link
Member

Since this has come up again - do we have compatible implementations of PCRE in Go and C++ that will work with unicode?

@sbarzowski
Copy link
Collaborator

Well, there is this thing: https://github.com/glenn-brown/golang-pkg-pcre. This is an interface to libpcre. It seems to hardcode assumptions about where libpcre is installed, though... I couldn't find anything else. Probably using libpcre directly with cgo would be a better option.

@kamalmarhubi
Copy link

My guess is that that package defeats part of the purpose of go-jsonnet, which is to allow go programs to use jsonnet without cgo. Could be wrong though. :-)

@sparkprime
Copy link
Member

Yeah I think unless we can find a library that has native Go and C++ support (for exactly the same regex syntax) we'll have to leave regexes as something that people add with native extensions.

@dcoles
Copy link
Contributor

dcoles commented May 25, 2019

Coming back full circle, would RE2 along with Go's built-in regexp package not be a good fit? There're Unicode aware and syntax compatible.

From Go's regexp package documentation:

"The syntax of the regular expressions accepted is the same general syntax used by Perl, Python, and other languages. More precisely, it is the syntax accepted by RE2 and described at https://golang.org/s/re2syntax, except for \C."

(The re2syntax link goes to the actual RE2 documentation)

@sparkprime
Copy link
Member

In that case I guess RE2 is the way forward after all :)

@dcoles
Copy link
Contributor

dcoles commented May 29, 2019

I'm currently prototyping RE2 regexp support in my master...dcoles:re2 branch.

Boolean matches can be implemented pretty trivially, but positional and named captures are going to require a bit more thought. The current plan is to have a match return an object upon successful match or null otherwise. For example:

$ jsonnet -e 'std.regexFullMatch("hello", "h(?P<mid>.*)o")'
{
   "captures": [
      "ell"
   ],
   "namedCaptures": {
      "mid": "ell"
   },
   "string": "hello"
}

This way you can still do things like assert std.regexFullMatch(self.foo, "pattern") != null for validation or use the object fields for accessing captured values.

@dcoles dcoles linked a pull request Jun 2, 2019 that will close this issue
@glenntrewitt
Copy link

glenntrewitt commented Jul 3, 2021

I see the PR, which I eagerly anticipate, but just to summarize the points and questions about RE2:

  • RE2 has native implementations in both C++ and Go.
  • RE2 supports Unicode.
  • There are wrappers for most languages. See the bottom of the README.

@Duologic
Copy link
Contributor

It's nice to see this move along, the discussions on the PR is promising.

I have a use case involving JSON schema, I'm building a validator in jsonnet and turns out JSON schema has a few features that use regular expressions. I don't know much about the different implementions of regex in the wild, the schema spec depends on the ECMA 262 implementation.

I think it would be safe to provide one native implementation in stdlib and if users need a different for their use case they can leverage the native functions feature (or if they feel adventures, they can implement one in jsonnet).

@Duologic
Copy link
Contributor

Just had a quick look in other projects as I was curious:

Kubernetes uses regexp to validate the JSON schema pattern attribute (link).

ogen has an interface with a fallback from regexp to dlclark/regexp2 in case regexp doesn't compile (link). This was introduced to workaround the shortcomings of re2, which the PCRE/ECMA-262 supposedly supports (re2 support table and re2 caveats). This fallback library might be interesting for go-jsonnet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants