Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode support: ml-ulex runtime, JSONParser #271

Open
Skyb0rg007 opened this issue Jun 10, 2023 · 5 comments
Open

Unicode support: ml-ulex runtime, JSONParser #271

Skyb0rg007 opened this issue Jun 10, 2023 · 5 comments
Labels
enhancement New feature or request

Comments

@Skyb0rg007
Copy link

Description

Improvements

The current ml-lpt runtime and JSONParser do not properly handle UTF-8.
The proposed improvements would solve this issue.

Unicode Input

The ml-lpt runtime library (specifically the ULexBuffer module) does not properly handle UTF-8 inputs, raising errors on valid bytes and ignoring others.
Comparing the UTF8.getu to ULexBuffer.getu reveals the issue: ULexBuffer never parses 4-byte Unicode sequences.
An easy fix is to update these implementations to be in-line with one another.

- val str = "\"" ^ UTF8.encode 0wx1F469 ^ "\"";
val str = "\"\240\159\145\169\"";
- print (str ^ "\n");
"👩"
- JSONParser.parse (JSONParser.openString str);
uncaught exception Incomplete
(* Expected result: val it = STRING "\240\159\145\169" : string *)

Additionally, the ULexBuffer module accepts invalid UTF8, and as a result can produce output strings which are invalid UTF8.
UTF8-encoded strings are not allowed to use codepoints in the surrogate pair range (0xD800-0xDFFF).

From Wikipedia:

Since RFC 3629 (November 2003), the high and low surrogate halves used by UTF-16
(U+D800 through U+DFFF) and code points not encodable by UTF-16 (those after U+10FFFF)
are not legal Unicode values, and their UTF-8 encoding must be treated as an invalid byte sequence.

The current implementation of ULexBuffer.getu does not account for this.

- val str = "\"" ^ UTF8.encode 0wxD800 ^ "\"";
- str
val str = "\"\237\160\128\" : string
- JSONParser.parse (JSONParser.openString str);
val it = STRING "\237\160\128" : string
(* Expected result: uncaught exception Incomplete (or some other exception) *)

Note: I do not propose changing the implementation of UTF8.encode, as it is sometimes useful to encode strings into WTF-8.

Unicode Escapes

The current implementation of the JSON lexer does not properly handle unicode inputs and escapes.
From the JSON Spec (pages 12-13):

Any code point may be represented as a hexadecimal escape sequence.
[..]
To escape a code point that is not in the Basic Multilingual Plane, the character may be represented as a
twelve-character sequence, encoding the UTF-16 surrogate pair corresponding to the code point. So for
example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E".

However, the current version of the JSONParser library does not implement this required part of the spec:

- JSONParser.parse (JSONParser.openString "\"\\uD834\\uDD1E\"");
val it = STRING "\237\160\180\237\180\158"
(* Expected behavior: val it = STRING "\240\157\180\158" : string *)
(* (the UTF-8 encoding of U+1D11E) *)

Instead, the lexer simply encodes the escapes into WTF-8.
The best fix is to specifically match for surrogate pairs in the lexer spec:

(* Old *)
<S>"\\u"{xdigit}{4}     => ( addUChar yytext; continue() )

(* New *)
%defs(
  fun addSurrogate s =
    let val u1 = (* Parse hex of high surrogate *)
        val u2 = (* Parse hex of low surrogate *)
    in  addStr (UTF8.encode (0wx10000 + ((u1 andb 0wx3FF) << 0w10) + (u2 andb 0wx3FF))
    end
)

%let nonSurrogateHex = [0-9a-ce-fA-CE-F]{xdigit}{3}|[dD][0-7]{xdigit}{2};
%let highSurrogateHex = [dD][89abAB]{xdigit}{2};
%let lowSurrogateHex = [dD][c-fC-F]{xdigit}{2};

<S>"\\u"{nonSurrogateHex} => ( addUChar yytext; continue() )
<S>"\\u"{highSurrogateHex}"\\u"{lowSurrogateHex} => ( addSurrogate yytext; continue() )
<S>"\\u"{xdigit}{4} => ( raise Fail "Unmatched surrogate pair" )
@Skyb0rg007 Skyb0rg007 added the enhancement New feature or request label Jun 10, 2023
@JohnReppy
Copy link
Contributor

For the first part of this (the ULexBuffer module), I recently pushed a fix for this problem to the legacy branch, but I want to do some more testing of the code before porting that fix to the development branch.

For the second part, do we need some sort of validity checking for valid surrogate pairs or do we just treat any two hex escapes in series as a pair?

@Skyb0rg007
Copy link
Author

We need to explicitly match escapes with specific values, since the only escapes that need to be handled differently are those in the surrogate pair range. For example "\u0000\uD800\uDFFF\u0000" should be treated as 3 codepoints (U+0000, U+103FF, U+0000), and "\u0000\uDFFF\uD800\u0000" is invalid.

Looking how other JSON parsers do it, the jq utility matches sequences of escapes, then handles surrogates in the code part. The OCaml yojson and the Rust serde libraries seems to use a form of state transition to handle surrogates. However I do think that explicitly matching consecutive hexcodes in the surrogate range is the easiest and most readable solution for ml-ulex.

I did realize that my example can be somewhat simplified to:

%let nonSurrogateHex = [0-9a-ce-fA-CE-F]{xdigit}{3}|[dD][0-7]{xdigit}{2};
%let surrogateHex = [dD][89a-fA-F]{xdigit}{2};

<S>"\\u"{nonSurrogateHex} => ( .. )
<S>"\\u"{surrogateHex}"\\u"{surrogateHex} => ( (* Assert high and low surrogate values here *) )
<S>"\\u"{surrogateHex} => ( raise Fail "Lone surrogate" )

@Skyb0rg007
Copy link
Author

I took a look at the patch submitted to the legacy branch, however it doesn't seem like it catches invalid surrogate pairs in the input stream:

- CM.make "./ml-lpt-lib.cm";
- val stream = TextIO.openString (UTF8.encode 0wxD800);
- val ustream = ULexBuffer.mkStream (0, fn () => TextIO.input stream);
- ULexBuffer.getu ustream;
SOME (0wxD800, S (...))

Because the input is not valid UTF8, the library should raise an exception and not pass on the invalid codepoint to the caller.

@JohnReppy
Copy link
Contributor

That patch just deals with the problem that 4-byte UTF-8 sequences were resulting in the Incomplete exception being raised. Surrogate pairs is a separate issue and there is no vetting of the validity of UTF-8 code points.

@JohnReppy
Copy link
Contributor

I've pushed another patch the the legacy branch so that the ULexBuffer.getu function will raise the Invalid exception on surrogate halves and code points that are too large (greater than U+10FFFF). ML-ULex ought to be modified to resync the input stream when it encounters incomplete or invalid UTF-8 code sequences, that will take some more work.
The issue with the JSONParser not handling Unicode escape sequences still needs to be addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants