Improve `int.toUnicode()` documentation #80

Marcono1234 · 2021-09-04T21:30:07Z

The documentation for the newly added int.toUnicode() predicate says:

Returns the unicode character for the receiver seen as a unicode code point

This is slightly misleading because CodeQL strings consist of UTF-16 code points. Therefore supplementary code points (> U+FFFF) will result in two CodeQL string characters (demonstrated by this query). It might also be good to describe its behavior for invalid code point values. For surrogate code point it does not seem to have a result either, e.g. 55296.toUnicode().
Also it should uppercase "Unicode".

I would recommend the following description (or similar):

Returns the Unicode character for the receiver seen as a Unicode code point. Because CodeQL strings consist of UTF-16 code units, supplementary code points (that is > U+FFFF) result in a CodeQL string of length 2. This predicate has no result if the int receiver does not represent a valid Unicode code point, or represents the code point of a surrogate character.

This requires changes to the built-in documentation (which is why I created the issue here) as well as the language specification.

The text was updated successfully, but these errors were encountered:

RasmusWL · 2021-09-20T08:34:28Z

Thanks for the insightful comments 👍

erik-krogh · 2021-09-20T11:10:57Z

Yes, something like 128512.toUnicode() will result in a string where the length() is 2.
And yes, invalid/surrogate characters have no result.

So you're right the documentation might be a bit misleading.

String lengths are hard and they are not a very useful measure, but they are probably the best we got for describing what happens for code points like 55296.
I tried to see if I could rewrite your suggestion into something that's more explicit the length of the string (and what kind of length it is), but that didn't turn out good.

So I think I might go with your suggestion. I'll let you know.

github-actions bot added the CLI label Sep 4, 2021

Marcono1234 changed the title ~~Clarify int.toUnicode() behavior for supplementary code points~~ Improve int.toUnicode() documentation Sep 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve `int.toUnicode()` documentation #80

Improve `int.toUnicode()` documentation #80

Marcono1234 commented Sep 4, 2021 •

edited

Loading

RasmusWL commented Sep 20, 2021

erik-krogh commented Sep 20, 2021

Improve int.toUnicode() documentation #80

Improve int.toUnicode() documentation #80

Comments

Marcono1234 commented Sep 4, 2021 • edited Loading

RasmusWL commented Sep 20, 2021

erik-krogh commented Sep 20, 2021

Improve `int.toUnicode()` documentation #80

Improve `int.toUnicode()` documentation #80

Marcono1234 commented Sep 4, 2021 •

edited

Loading