Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement grapheme clusters #16916

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft

Implement grapheme clusters #16916

wants to merge 5 commits into from

Conversation

lhecker
Copy link
Member

@lhecker lhecker commented Mar 21, 2024

Now I know this PR is somewhat large, but it's honestly mostly tables.

First, this adds CodepointWidthDetector_gen.go which

  • parses ucd.nounihan.grouped.xml
  • computes the cluster break property for each codepoint
  • computes the East Asian Width property for each codepoint
  • compresses everything into a 4-stage trie
  • computes a LUT of cluster break rules between 2 codepoints
  • and serializes everything to C++ tables and helper functions

Next, this adds CodepointWidthDetectorTests_gen.go which

  • parses GraphemeBreakTest.txt
  • splits each test into graphemes and break opportunities
  • and serializes everything to a C++ table

Finally, CodepointWidthDetector.cpp is rewritten from scratch to

  • parse codepoints out of a UTF-16 string at a given offset
    (Previously it was given pre-computed codepoints as string-views.)
  • accumulate codepoints until a break opportunity arises
  • accumulate the total width of a grapheme

The old per-codepoint code was also rewritten to match this new
grapheme-based interface. However, its LUT was kept entirely.

This is part of #1472

Validation Steps Performed

@lhecker lhecker added Issue-Feature Complex enough to require an in depth planning process and actual budgeted, scheduled work. Product-Conhost For issues in the Console codebase Area-Rendering Text rendering, emoji, complex glyph & font-fallback issues Product-Terminal The new Windows Terminal. labels Mar 21, 2024
upss
UPSS

Check warning

Code scanning / check-spelling

Ignored Expect Variant Warning

UPSS is ignored by check spelling because another more general variant is also in expect. (ignored-expect-variant)
Comment on lines 18 to 19
fmt.Println(err)
os.Exit(1)
Copy link
Member

@DHowett DHowett Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
fmt.Println(err)
os.Exit(1)
log.Fatalln(err)

src/types/CodepointWidthDetector.cpp Outdated Show resolved Hide resolved
{
char32_t codepoint = 0;
const auto l = lead & 15;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: personally i prefer masking constants to be hex, but if you prefer decimal that is OK too

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case I'm using 15 because that way I find it easier to see that it makes indexing into a [16] element large array safe.

src/types/CodepointWidthDetector_gen.go Outdated Show resolved Hide resolved
@lhecker
Copy link
Member Author

lhecker commented Mar 21, 2024

@j4james If you ever get a chance to test this branch, I'd love to hear your thoughts on it! 🙂

In particular, Dustin and me have been talking about whether it's necessary for us to allow users to disable grapheme support. It would definitely be beneficial if it were (restores compatibility with conhost v2) but also disadvantageous (limits our freedom in designing the text storage).

@j4james
Copy link
Collaborator

j4james commented Mar 22, 2024

First off, I have to say this is very cool. However, I do think it needs to be disabled by default, because the majority of Linux apps will expect terminals to be compatible with wcwidth, and this is not.

As a simple example, trying pasting something like this into a WSL bash shell and then arrow back to the start of the line:

🏳‍🌈🏳‍🌈🏳‍🌈

The first thing you'll notice is an extra flag getting rendered at the end of the line, and then when you arrow back, the cursor ends up part way through the prompt.

image

Those flags consist of a flag emoji, a zero width joiner, and a rainbow emoji. The flag has a width of 1, ZWJ a width of 0, and the rainbow a width of 2, so most apps will expect the combination to be 3. But since we're treating it as a width of 2, we're out of sync with what bash is expecting, and that's how things end up breaking.

The good news, though, is that there's a mode we can support (?2027), that lets applications decide whether they want the old wcwidth behavior, or they're smart enough to handle all the fancy Unicode stuff that wcwidth doesn't cover. This is something that was originally developed by Contour, but a number of other terminals have adopted it, and I'm sure I came across at least one application that was already using it.

I don't know the details well enough to know whether your implementation is compatible with what is required for mode ?2027, but if it is, I would encourage you to consider supporting it. It's still marked as being in active development, so if there's something you disagree with, I'm sure Christian will be happy to discuss it. The spec is tracked in the repo here: https://github.com/contour-terminal/terminal-unicode-core

@j4james
Copy link
Collaborator

j4james commented Mar 22, 2024

One other thing to bear in mind is how this will work with third-party terminals. Assumedly for conpty to work correctly, the conhost width calculations and the terminal width calculations need to be in sync. So there needs to be a way for the terminal to tell conhost what algorithm it is using, or possibly what algorithms it can support (if, like Contour, the mode is switchable).

Even ignoring the third party terminals, this would assumedly be necessary if you want to make it a user-controllable option in the Windows Terminal settings. I guess this is another use case for the conpty ioctl thing.

@lhecker
Copy link
Member Author

lhecker commented Mar 22, 2024

First off, I have to say this is very cool.

I'm super happy to hear you feel this way! 😌

I apologize for the very long text below. I tried to delete as much as I could, but... oh well:


However, I do think it needs to be disabled by default, because the majority of Linux apps will expect terminals to be compatible with wcwidth, and this is not.

I don't wish to do this, but I am open to changing my mind. In short, I believe that the default mode of operation should be "forward looking" / progressive, with the optimistic mindset that we'll have more decades of terminal usage in front of us than behind us.

I'm otherwise afraid that it puts less pressure on the general terminal landscape to be progressive. As far as I know the other modern terminals that support graphemes have the same stance.
I've recently triaged a couple PSReadLine issues for instance which made me realize it doesn't even support surrogate pairs to begin with, so the wcwidth situation there is even worse. Previously we've supported this implicitly because we were lenient on our own Unicode support as well, but if we now make this feature disabled by default, we turn it into an explicit support for leniency. I think it'd be better if everyone was nudged into supporting modern Unicode instead, ourselves included.


As a simple example, trying pasting something like this into a WSL bash shell and then arrow back to the start of the line:

🏳‍🌈🏳‍🌈🏳‍🌈

(FYI your flag emoji is missing U+FE0F which is why it doesn't properly render. This works: 🏳️‍🌈🏳️‍🌈🏳️‍🌈.)
((FFYYII This is a neat tool I found: https://unicode-explorer.com/tools/decode/))

I've tested this in Contour, WezTerm, kitty, foot, and Konsole and in all of them this failed to work in bash out of the box. I'm not sure whether there's something I need to configure to make it work in any of those terminals, but I believe many of them don't make this configurable. However, I believe this also somewhat validates my position on the choice of the default behavior above.

In my personal opinion this is fine though, because most text is in the BMP and in its NFC form (= no combining marks). With such inputs, bash and other wcwidth based applications will continue to work as before.


The good news, though, is that there's a mode we can support (?2027), that lets applications decide whether they want the old wcwidth behavior, or they're smart enough to handle all the fancy Unicode stuff that wcwidth doesn't cover. This is something that was originally developed by Contour, but a number of other terminals have adopted it, and I'm sure I came across at least one application that was already using it.

Yes, I'm definitely intending to implement ?2027.

However, I don't believe Contour allows grapheme support to be turned off via this sequence. Contour calls it DECMode::Unicode and the setter is not implemented. WezTerm calls it DecPrivateModeCode::GraphemeClustering and they leave it permanently enabled as well. It is possible though that I've missed something for the others.

Edit: I found a support table! https://mitchellh.com/writing/grapheme-clusters-in-terminals#terminal-comparison


One other thing to bear in mind is how this will work with third-party terminals. Assumedly for conpty to work correctly, the conhost width calculations and the terminal width calculations need to be in sync. So there needs to be a way for the terminal to tell conhost what algorithm it is using, or possibly what algorithms it can support (if, like Contour, the mode is switchable).

FWIW this is why I strongly believe that a conhost.lib is the right path forward. Even with an ioctl thingy, we'd not be able to ship every variant that people would need. fish-shell's widecharwidth for instance assigns unassigned and noncharacter codepoints a width of 0, but many other implementations, including the wcwidth implementations I'm aware of, as well as Windows Terminal so far, assign them a width corresponding to the EastAsian attribute. So even if we had a toggle to disable grapheme support, it would still have broken edge cases. With a in-process .lib it would at least work consistently correct according to the hosting terminal, as long as one doesn't use SSH (which requires something like the current ConPTY architecture).


All of the above aside, I'm not against adding such a toggle per se. I'm just worried it'll either:

  • not provide a significant reduction in edge cases despite increase in complexity
  • block us from making architectural improvements because it locks us into 1 specific way of representing text

If you have any arguments in favor of adding a toggle, or know about any applications that this change would significantly regress, I'd be 110% open to add it. In fact the mere circumstance that you argued in favor it already made me heavily consider it. 😅 I hope the above gives a few good arguments though, as to why we may want to consider not having such a toggle.

@lhecker
Copy link
Member Author

lhecker commented Mar 22, 2024

Out of curiosity I've tried to implement a wcwidth mode that is faithful to how my local WSL bash interprets it, and in a way that works with "🏳️‍🌈". I can't figure it out.
It definitely has zero-width character support because the old Windows Terminal approach clearly doesn't work. I then tried to split up graphemes whenever we encounter a non-zero-width character after we've already encountered one before (= split 🏳️‍🌈 into 🏳 and 🌈) and that almost works, but if I paste a lot of 🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈... in the prompt, backspacing breaks it again.

If you have any tips on how to implement a wcwidth-like mode, I'll try to do so.

@j4james
Copy link
Collaborator

j4james commented Mar 23, 2024

I'm otherwise afraid that it puts less pressure on the general terminal landscape to be progressive.

Are you hoping that making bash not work properly in Windows Terminal will force them to change, and instead break in terminals like XTerm and Gnome Terminal? Because that seems very unlikely. And I can understand the "little guys" like Contour or WezTerm choosing to do their own thing - nobody is going to give a damn - but if Microsoft tries to use its "monopoly" position to force through breaking changes in Linux, that's probably not going to go down well.

Now if XTerm and VTE were in on this too, I think it would be fine. But I can't imagine XTerm breaking decades of backwards compatibility, and recent discussions on the VTE issue tracker gave me the impression that they weren't keen on breaking things either (which surprised me actually). It's possible their views might change at some point, but until there is broader consensus amongst the main terminals, my recommendation would be to avoid deliberately breaking things.

And just to be clear, I personally don't care about the defaults either way. I just don't want Windows Terminal to be seen as another one of Microsoft's EEE attempts.

Yes, I'm definitely intending to implement ?2027. However, I don't believe Contour allows grapheme support to be turned off via this sequence.

I haven't been following this development very closely, but I suspect that is just because they haven't fully implemented the spec yet. The commit message "Starts adapting Terminal Unicode Core's DEC mode 2027" suggests it's still a work in progress. Certainly the spec is very clear about backwards compatibility: "Everything is disabled by default, so legacy apps don’t break more than they used to break already."

That was the whole point of introducing a mode for this in the first place. If you don't care about backwards compatibility, there's really no need for this mode.

If you have any arguments in favor of adding a toggle

Well if you decide not be backwards compatible by default, then when you get a bug report about something like bash breaking, you can at least tell people there's an option they can toggle to fix it, rather than saying you're deliberately breaking apps to force them to change.

If you have any tips on how to implement a wcwidth-like mode, I'll try to do so.

I thought it would just be a matter of dropping any zero width characters, and treating the other characters as either single or double width, as specified in EastAsianWidth. So in the rainbow flag case, any ZWJ and VS16 characters would be ignored, the flag would occupy one cell, and the rainbow would occupy two.

Looking a the script output in bash, when it redraws after you've pasted six flags, you can see it output 18 BS characters to try and position the cursor at the end of the prompt, and then output the flags again without reverse video. This is obviously assuming a flag is three cells wide (hence the 6*3 BS), so for us it's redrawn in the wrong place.

Then as you backspace, you can see it send BS BS EL for the rainbow, and BS EL for the flag, which is again exactly what you'd expect assuming a two cell width for the rainbow, and a one cell width for the flag.

Btw if you want to test with a wcwidth-compatible terminal on Windows, this works as expected in mintty (assuming you're using the wsltty version).

@j4james
Copy link
Collaborator

j4james commented Mar 24, 2024

This is going to seem like a stupid edge case, but I think it might be an indication that something is not quite right in the architecture. Try this in a WSL shell:

printf "\U01f3f3\ufe0f\e7 \U01f308 DON'T MOVE"; sleep 1; printf "\e8\u200d\n"

Initially it displays a flag and a rainbow with a space between them, but after a second they join together. This shrinks the combination down to two cells, but the "DON'T MOVE" text remains where it started.

Now if you try the same thing without the sleep, the "DON'T MOVE" text appears three columns to the left. I'm assuming that's because the join happens before the conpty renderer has had a chance to refresh.

That seems wrong to me. It should not be possible for text to end up in a different position just based on timing. And I think the underlying issue is that it shouldn't be possible to "join" two characters which have been output separately, just by inserting a ZWJ between them.

@lhecker
Copy link
Member Author

lhecker commented Mar 24, 2024

I haven't fully debugged it yet, but I believe this shows two highly related issues:

  • The text renderer joins all cells to a single string and then passes that to DirectWrite. This causes it to draw 🏳️‍🌈 even though the text buffer contains 🏳️🌈 in separate cells. This can be easily observed in OpenConsole with AtlasEngine, since the ConPTY issue doesn't occur there. I'm not 100% sure how urgent fixing this issue is though.
  • ConPTY does the same thing, but it causes Windows Terminal to never see the cursor movement that happened in between the 2 writes in the first place. That's a lot more problematic.

We can solve both issues in two ways:

  • Assemble rows to strings. Then iterate over graphemes and check whether 1 grapheme consists of >1 cell. If we encounter this, then we know that we need to insert a control character, escape sequence, or similar, to break up the grapheme.
  • After we encounter a cursor movement or similar, and if the first write starts with a combining character of any kind, we insert a single 0x20 space first.

Personally speaking, I highly prefer the latter as it makes the "what you see" when we render on the screen consistent with "what you get" when you copy the text to your clipboard. It also simplifies our implementation, because the former option requires us to implement the workaround both, in ConPTY and in our various text renderers, whereas the latter is a simple if condition in Replace() / Insert() with a couple lines of code.
Dustin has previously mentioned that this is not a good idea, because it means that after writing such text to the console buffer you can't read the exact same text back. Also, it sort of lies to the user what the terminal application actually wrote. I agree in both accounts and so I don't particularly like my preference for the latter option either. I do still prefer it because I think the architectural simplifications are worth it.

Thoughts?

@j4james
Copy link
Collaborator

j4james commented Mar 24, 2024

The text renderer joins all cells to a single string and then passes that to DirectWrite. This causes it to draw 🏳️‍🌈 even though the text buffer contains 🏳️🌈 in separate cells. I'm not 100% sure how urgent fixing this issue is though.

That makes sense. And I don't think that's urgent to fix at this stage. The important thing is getting the width calculations correct first. The rendering issues can be dealt with later.

After we encounter a cursor movement or similar, and if the first write starts with a combining character of any kind, we insert a single 0x20 space first.

I think this would be my preference. My expectation was that these combining characters would be interpreted at the time the stream of text was received, rather than when rendering the buffer, and this seems like a neat way to accomplish that.

Dustin has previously mentioned that this is not a good idea, because it means that after writing such text to the console buffer you can't read the exact same text back.

I don't think that's a big deal though. Because the minute you've moved the cursor, you're effectively in an edit mode - you're now changing what's in the buffer. You can't expect to read back text that've you overwritten. It's like writing a\x08b and expecting the buffer to somehow contain both a and b.

@lhecker

This comment was marked as off-topic.

@lhecker lhecker marked this pull request as draft March 24, 2024 16:41
@j4james
Copy link
Collaborator

j4james commented Mar 24, 2024

Right now your test with 🏳️‍🌈 doesn't work either and it hasn't forced them to change to match our behavior.

The difference is that the current behavior is viewed as a bug. When someone complains about the widths being off somewhere, we point them to the ZWJ issue and say we're still working on it.

I believe my comment about progressiveness shouldn't have conveyed that I'm trying to "force" others unless read with an "us vs them" angle. That's certainly not what I had in mind when I wrote it. It's quite saddening to imagine that it may be read that way. Does this mean I'm not part of "the team"?

I apologize if I've misinterpreted your wording, and if there are teams, I am definitely on your side. I love what you're doing here, but I want to avoid giving ammunition to the groups that are looking for an excuse to attack MS.

To be clear, and as I've said in a previous comment, I don't mind making ?2027l a switch that puts the terminal into a wcwidth-compatible mode and users could just put that into their .bashrc.

If we are going to have this functionality enabled by default, then I think a settings toggle would be preferable to mode ?2027, just because it's more user-friendly. And I'm not sure it's even worth bothering with the mode if everyone is just going to default it on - that seems pointless to me

That said, if we don't yet have an easy way to implement a toggle, then a mode is definitely better than nothing. I just want to make sure we have something we can tell people if anyone does complain.

And again I must apologise if my earlier message has caused you a lot of stress. That's precisely what I was trying to avoid, and I just did a very poor job of it.

@j4james
Copy link
Collaborator

j4james commented Mar 24, 2024

And you know what, if the wcwidth-compatible algorithm is not a simple fix, maybe we should just leave it for a followup issue. If nobody complains, we can ignore it indefinitely. And if anyone does complain, we can at least say it's a known issue, and gather some real-world test cases before trying to address it.

I'd hate for you to give up on this PR just because I'm being paranoid about hypothetical complaints.

@o-sdn-o
Copy link

o-sdn-o commented Mar 25, 2024

I was happy when I saw that you began implementing text segmentation into grapheme clusters. I think this is the right step in improving the console subsystem, despite the fact that this may go against the existing code base and de facto “standards” in the terminal world, up to changing the terminal architecture (internal text buffer) and rewriting a large amount of functionality literally from scratch.

A console application may have its own interpretation of the Unicode standard, burdened by national peculiarities that the terminal cannot predict in any way, and it is necessary to somehow allow the console application to set the boundaries of grapheme clusters itself. If the application does not specify the boundaries of grapheme clusters and does not even make hints about this, then the terminal has the right to set the boundaries of the clusters as it pleases, either codepoint-wise or grapheme-wise.

The need for this becomes obvious in any attempt to create a more or less competitive console text editor or a panel file manager like TotalCommander, in which there is no point in rendering text in codepoint, since there it is necessary to display grapheme clusters in the same way as GUI editors/file managers display them . The point here is not only in the plain text of the edited document, which can still be somehow perceived in a codepoint manner, but in the file names that the user sees in GUI mode exclusively in grapheme representation. This applies to both the Unix world and the Windows world.

The fact that in the Unix world in terminals, text is displayed in codepoint, in my opinion this is a historical flaw associated with technical limitations, and sooner or later this flaw will be fixed. And it would be an even bigger mistake to rely on this codepoint behavior when developing a terminal emulator or console application.

Edit:
Due to the fact that GUI terminals do not produce grapheme-based output (even without support for grapheme halves), I have to write my own GUI frontend for my console applications to get around this. Perhaps this is my personal wish, and no one should rely on it when making decisions about the architecture of global systems that affect many people. Perhaps this is generally talking about some kind of dichotomous subsystem, which is neither a GUI nor a TUI, which should go its own way with grapheme clusters and support for keyboards with national layouts.

@DHowett
Copy link
Member

DHowett commented Mar 25, 2024

apparebit/demicode works fantastic ✅

/cc @apparebit 😄

@lhecker
Copy link
Member Author

lhecker commented Mar 25, 2024

image

There are 3 errors:

  • 2 times we fail to make a zero width character zero width. Unfortunately our architecture doesn't support storing such characters independently and a width of 1 as a minimum is enforced.
  • When encountering a VS16 we make any character wide, because this greatly simplifies the width measurement. I don't expect this to be problematic in practice.

@mominshaikhdevs
Copy link

I wish I could contribute. I hope this PR gets merged.

@tusharsnx
Copy link
Contributor

tusharsnx commented Apr 1, 2024

I don't know if it's feasible or not, but could we split the cursor movement bits to a different PR? I guess it has its own value regardless of whether we're displaying graphemes with extra empty cells.

Being able to move between the codepoints of a grapheme cluster, and delete a few codepoints from the cluster is funny and odd.

cursor-movement-with-large-grapheme-cluster

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Rendering Text rendering, emoji, complex glyph & font-fallback issues Issue-Feature Complex enough to require an in depth planning process and actual budgeted, scheduled work. Product-Conhost For issues in the Console codebase Product-Terminal The new Windows Terminal.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants