Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Unicode support #24

Open
choeger opened this issue Oct 15, 2014 · 7 comments
Open

[Feature Request] Unicode support #24

choeger opened this issue Oct 15, 2014 · 7 comments

Comments

@choeger
Copy link

choeger commented Oct 15, 2014

At a glance, this whole library seems like a very well-thought piece of software (limited scope, defined solution). Unfortunately, it does not support unicode right now. But unicode should be the standard in this millenium. So here is my proposal: Instead of using chars and strings exclusively, abstract the library over the concrete code-point and input representations. Then someone (me) could simply extend the library by providing a suitable unicode support. I understand that this kind of abstraction might yield some performance regressions, but it would yield a whole batch of new usecases.

@c-cube
Copy link
Contributor

c-cube commented Dec 2, 2014

Could D. Bunzli's Uutf be used to iterate over unicode chars? That might also help to parametrize over the input stream (string, bigarray, stream of strings, etc.) for #20 ...

@vouillon
Copy link
Member

vouillon commented Dec 2, 2014

The main issue to make the implementation generic is that it is table-based. This works well when there are only 256 possible characters, but does not scale to the one million Unicode code points...

One thing that should work is to translate regular expressions defined in term of Unicode code points into regular expressions defined in term of bytes and match UTF-8 strings byte by byte.

@zoggy
Copy link

zoggy commented Jan 12, 2016

Any hope to have unicode supported soon ?

@Drup
Copy link
Collaborator

Drup commented Jan 12, 2016

I don't think @nojb or anyone else is working on it right now, but it could change if someone was motivated. ;)

@XVilka
Copy link

XVilka commented Feb 5, 2018

Surprising that it wasn't still implemented

@c-cube
Copy link
Contributor

c-cube commented Feb 5, 2018

Someone needs to do it, and it's hard™ 🙂

@nojb
Copy link
Contributor

nojb commented Feb 5, 2018

As far as I understand from the discussion in #48, the implementation there is viable and could be used as a basis for further work. I can rebase that PR against the current master, but unfortunately I am rather overloaded at the moment so cannot commit to doing the "further work" that may be necessary to get it integrated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants