Use direct unicode.org data to generate CharClasses.cpp #2138

unxed · 2024-04-09T16:37:22Z

As almost any char width detection lib have its own problems and limitations, I suggest using the most recent data files downloaded from unicode.org to generate CharClasses.cpp.

Example how it can be done:
https://github.com/ridiculousfish/widecharwidth/blob/master/generate.py

Dazzar56 · 2024-04-09T21:29:04Z

Вообще, как я заметил, этот подход предпочитают многие популярные консольные приложения.
Например NeoVim еще год назад выпилил у себя все wcwidth(). Цитата мейнтейнера:

I wouldn't trust libc wchar functions over our own stuff in ten thousand years. There are good c unicode libraries out there we should consider (like libutf8proc and libgrapheme), but glibc or anything contaminated by the POSIX standard is not.

Вместо этого они при каждой сборке скачивают свежие таблицы с https://unicode.org/Public/UNIDATA/ и на Lua генерируют константные таблицы.
В коде выглядит это так:
https://github.com/neovim/neovim/blob/4946489e2e3eeca5c831faf9fe86cbf1229701e2/src/nvim/mbyte.c#L471-L507

У NeoVim огромное комьюнити, которое непрерывно развивает этот редактор. Так что они уже давно набили все возможные шишки с поддержкой юникода и каких-либо артефактов отрисовки там очень давно не встречалось. Стоит присмотреться к их кодовой базе получше. Конкретно в этом файле как раз собраны все функции работы с юникодом.

TrNullFree · 2024-04-09T21:37:00Z

FYI
I looked at far3. It takes (file char_width.cpp ) a ready-made table from the Windows Terminal sources (file src\types\CodepointWidthDetector.cpp ). And WT takes it from a file generated by the CodepointWidthsFromUCD.ps1 script .
The CodepointWidthsFromUCD.ps1 script generates code based on Unicode UCD XmlDocument "ucd.nounihan.flat.xml"
The script uses 3 overrides (overrides width to 1):

      <override first-cp="2500" last-cp="259F" ea="H" comment="box-drawing and block elements require 1-cell alignment" />
      <override first-cp="4DC0" last-cp="4DFF" ea="H" comment="hexagrams are historically narrow" />
      <override first-cp="FE20" last-cp="FE2F" ea="H" comment="narrow combining ligatures (split into left/right halves, which take 2 columns together)" />

unxed mentioned this issue Apr 9, 2024

Char width detection problem #2132

Closed

unxed mentioned this issue Apr 14, 2024

Unicode issues left — metabug #2157

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use direct unicode.org data to generate CharClasses.cpp #2138

Use direct unicode.org data to generate CharClasses.cpp #2138

unxed commented Apr 9, 2024

Dazzar56 commented Apr 9, 2024

TrNullFree commented Apr 9, 2024 •

edited

Use direct unicode.org data to generate CharClasses.cpp #2138

Use direct unicode.org data to generate CharClasses.cpp #2138

Comments

unxed commented Apr 9, 2024

Dazzar56 commented Apr 9, 2024

TrNullFree commented Apr 9, 2024 • edited

TrNullFree commented Apr 9, 2024 •

edited