-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Directional confusability: _א1 and _1א not detected (unicode 15.1's new TR55, and updated TR39) #12929
Comments
Hi @mrluc, thank you for a great initial description. My initial notes:
Feel free to add/remove/disagree. Thank you! |
Very cool. I'd love to contribute when I can circle back to this, maybe via sketching out a reference implementation PR. As an aside -- the project/product I'm doing right now is an absolute joy, thanks in large part to the many person-years of ❤️ that the Elixir team has put into things. LiveView, Nx, Vix, eVision, and Image are the gifts I'm most thankful for this morning. Re: notes above:
^^^ super 👍 atm
Wonderful, yeah that would likely make things much simpler.
^^^ yes, they only say it in that context I believe -- in the context of identifiers using 'chunked mixed script detection'. Planning to follow up in Oct-Nov if nobody dives in before then |
Hi @mrluc, a quick ping in case you still plan to tackle this. :) |
Yargh — my original availability for this got blown up 🤦♂️ ; I’d still
love to help but I must confess that I haven’t had the time to think
through an impl yet. The next likely availability for me is mid-late April
as far as I know now; I might get a chance to look in march but that’s less
sure. If someone wants to start of course that’d be sweet and I can try to
support.
…On Thu, Feb 15, 2024 at 12:17 PM José Valim ***@***.***> wrote:
Hi @mrluc <https://github.com/mrluc>, a quick ping in case you still plan
to tackle this. :)
—
Reply to this email directly, view it on GitHub
<#12929 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAJKPEFYTADSAKLFOE6NLDYTZGMZAVCNFSM6AAAAAA4XDI2USVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBWHA2DGNBWGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I'm willing to take a stab at this if there are no other takers. |
Perhaps @mrluc is referring to section 5.1.6 in TR55?
|
Example from the updated/now-way-more-complex UTS 39: "LTR, and RTL, and FS confusability should be used when it is inappropriate to enforce that strings be single-script, or at least single-directionality; this is the case in programming language identifiers. See Section 5.1, Confusability Mitigation Diagnostics, in Unicode Technical Standard #55, Unicode Source Code Handling [UTS55]." -- from uts39/tr39 |
To fix this issue, I think I remember would mostly consist of implementing the new confusability skeleton algorithm, "bidiSkeleton", in the updated uts 39. It doesn't look as trivial any more, haha, and they call out that it's costlier to run, but there are some points that are hopeful for an implementation:
That would catch the example mentioned above. But then, separately, there's that question of changing what script mixing Elixir allows in identifiers, if we want to follow the new suggestions that imply loosening the Restriction Level, which would allow eg. Currently, we only allow 'Highly Restrictive' script mixing in identifiers, which basically allows only the combinations of Latin script + some of several asian languages. That's why, in the example for this issue, I could only use non-hebrew characters that are shared by all scriptsets ( The new versions of UTS 39 and 55 express the strong opinion that it is inappropriate to limit identifiers to single-script or single-direction in programming language identifiers, which equates to a strong claim that 'no programming language should be Highly Restrictive'; an example from UTS 39:
They're careful to say, in UTS 55 when defining chunk confusability in identifiers, that the guidance is for "if you're looser than Highly Restrictive, and also X and Y are true ..." -- but on the other hand, as you see above, they strongly word things. It gives to understand "you're restricting people from doing things that aren't actually dangerous, just because they're from Russia, India, or any other non-Highly-Restrictive-blessed script mixing region!" 😆 The horror. They're right about that of course; the question is how to do it. I think we landed on Highly Restrictive because it was a clearly-defined 'safe' restriction level to start our compliance from, which at the same time didn't impose any additional effort on maybe 4-5B people (latin script users + specified asian script users) -- others can still write identifiers in their languages, but can't mix with Latin letters. As they point out, it's better to come up with a better definition of "what's confusing?" instead of the hammer of disallowing the script mixing for some and allowing it for other scriptsets. So, what's better than what we've got now? This new stuff:
It's cool, do the same check, but for "chunks" within an identifier, instead of the whole identifier. They introduce the concept of 'identifier chunks', and mention that there are 'chunk' separators within identifiers that we humans can parse, and that a machine can recognize, and thus that there are identifiers that as a whole string are 'Mixed Script' (accumulated intersection of scriptsets is empty if accumulated over every character in the string), but that are not actually dangerous from a script mixing perspective, because each separate chunk does not mix scripts within itself! To modify one of their examples,
Wrap-upAs I understand it, when we fix the issue that originated this thread, probably by supporting bidiSkeleton, we would be compliant with the new version of UTS 39 again. We'd still be using Highly Restrictive! Which is 😱 now, maybe. So then, if we wanted to move from Highly Restrictive to relying on chunked identifier confusability, that'd mean implementing:
And that may be able to be done, literally, by just resetting the scriptset every time we encounter an underscore when parsing an identifier! 😆 That'd be amazing, but I'd doubt it will be that simple. As a side note -- I vastly prefer josevalim's recco of underscores as separators for this, for a reason I didn't see mentioned as a risk in the UTS:
cc @josevalim @kipcole9 -- also kipcole ty so much for your work on so many useful libraries 🥲 |
I am still of the opinion of being conservative, because it is clear the best practices are still evolving and we don’t want to end up in a place we have to revert recommendations. To recap:
and perhaps use the same for bidi, but that can come later. Anyone against? |
Elixir and Erlang/OTP versions
otp 25, elixir 14
Operating system
osx
Current behavior
An issue to kick off convo -- I found a few implications for us upon reading 15.'s new technical report on source code, TR55 and the updated TR39.
As the title says -- in visual studio code and Emacs, I see the same thing I see in the issue title: what look like identical tokens. I can't paste this code into a selectable code sample here, because this editor does wacky things to it! But I'm pasting an image below.
Compiles and runs fine despite the tokens looking confusable.
Expected behavior
Should fail or warn as the variable names are visually confusable -- only when selecting text/moving cursor do you see the order change. I'll be curious to see if this survives in the issue title or does anything weird there, actually!
We would need to implement the brand new definition of Confusability -- LTR confusability in this case -- to catch it.
Found from reading a TR55 example -- they classify levels of confusability detection (we'd be Type 1, I guess), but they mention this case about Rust's detection:
There's a lot in there that's relevant, actually, in addition to big changes to TR39. This is the group of people/technical report that's actually thinking closest to 1:1 about the whole experience for PL creators <> users writing in other languages.
There's some stuff we may disagree on
I'd have bandwidth to look into it after EOM probably; I'm doing a decently painful project that ships around then.
The text was updated successfully, but these errors were encountered: