Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve int.toUnicode() documentation #80

Open
Marcono1234 opened this issue Sep 4, 2021 · 2 comments
Open

Improve int.toUnicode() documentation #80

Marcono1234 opened this issue Sep 4, 2021 · 2 comments
Labels

Comments

@Marcono1234
Copy link
Contributor

Marcono1234 commented Sep 4, 2021

The documentation for the newly added int.toUnicode() predicate says:

Returns the unicode character for the receiver seen as a unicode code point

This is slightly misleading because CodeQL strings consist of UTF-16 code points. Therefore supplementary code points (> U+FFFF) will result in two CodeQL string characters (demonstrated by this query). It might also be good to describe its behavior for invalid code point values. For surrogate code point it does not seem to have a result either, e.g. 55296.toUnicode().
Also it should uppercase "Unicode".

I would recommend the following description (or similar):

Returns the Unicode character for the receiver seen as a Unicode code point. Because CodeQL strings consist of UTF-16 code units, supplementary code points (that is > U+FFFF) result in a CodeQL string of length 2. This predicate has no result if the int receiver does not represent a valid Unicode code point, or represents the code point of a surrogate character.

This requires changes to the built-in documentation (which is why I created the issue here) as well as the language specification.

@github-actions github-actions bot added the CLI label Sep 4, 2021
@Marcono1234 Marcono1234 changed the title Clarify int.toUnicode() behavior for supplementary code points Improve int.toUnicode() documentation Sep 7, 2021
@RasmusWL
Copy link
Member

Thanks for the insightful comments 👍

@erik-krogh
Copy link

Yes, something like 128512.toUnicode() will result in a string where the length() is 2.
And yes, invalid/surrogate characters have no result.

So you're right the documentation might be a bit misleading.

String lengths are hard and they are not a very useful measure, but they are probably the best we got for describing what happens for code points like 55296.
I tried to see if I could rewrite your suggestion into something that's more explicit the length of the string (and what kind of length it is), but that didn't turn out good.

So I think I might go with your suggestion. I'll let you know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants