Unicode support: ml-ulex runtime, JSONParser #271

Skyb0rg007 · 2023-06-10T19:56:02Z

Description

Improvements

The current ml-lpt runtime and JSONParser do not properly handle UTF-8.
The proposed improvements would solve this issue.

Unicode Input

The ml-lpt runtime library (specifically the ULexBuffer module) does not properly handle UTF-8 inputs, raising errors on valid bytes and ignoring others.
Comparing the UTF8.getu to ULexBuffer.getu reveals the issue: ULexBuffer never parses 4-byte Unicode sequences.
An easy fix is to update these implementations to be in-line with one another.

- val str = "\"" ^ UTF8.encode 0wx1F469 ^ "\"";
val str = "\"\240\159\145\169\"";
- print (str ^ "\n");
"👩"
- JSONParser.parse (JSONParser.openString str);
uncaught exception Incomplete
(* Expected result: val it = STRING "\240\159\145\169" : string *)

Additionally, the ULexBuffer module accepts invalid UTF8, and as a result can produce output strings which are invalid UTF8.
UTF8-encoded strings are not allowed to use codepoints in the surrogate pair range (0xD800-0xDFFF).

From Wikipedia:

Since RFC 3629 (November 2003), the high and low surrogate halves used by UTF-16
(U+D800 through U+DFFF) and code points not encodable by UTF-16 (those after U+10FFFF)
are not legal Unicode values, and their UTF-8 encoding must be treated as an invalid byte sequence.

The current implementation of ULexBuffer.getu does not account for this.

- val str = "\"" ^ UTF8.encode 0wxD800 ^ "\"";
- str
val str = "\"\237\160\128\" : string
- JSONParser.parse (JSONParser.openString str);
val it = STRING "\237\160\128" : string
(* Expected result: uncaught exception Incomplete (or some other exception) *)

Note: I do not propose changing the implementation of UTF8.encode, as it is sometimes useful to encode strings into WTF-8.

Unicode Escapes

The current implementation of the JSON lexer does not properly handle unicode inputs and escapes.
From the JSON Spec (pages 12-13):

Any code point may be represented as a hexadecimal escape sequence.
[..]
To escape a code point that is not in the Basic Multilingual Plane, the character may be represented as a
twelve-character sequence, encoding the UTF-16 surrogate pair corresponding to the code point. So for
example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E".

However, the current version of the JSONParser library does not implement this required part of the spec:

- JSONParser.parse (JSONParser.openString "\"\\uD834\\uDD1E\"");
val it = STRING "\237\160\180\237\180\158"
(* Expected behavior: val it = STRING "\240\157\180\158" : string *)
(* (the UTF-8 encoding of U+1D11E) *)

Instead, the lexer simply encodes the escapes into WTF-8.
The best fix is to specifically match for surrogate pairs in the lexer spec:

(* Old *)
<S>"\\u"{xdigit}{4}     => ( addUChar yytext; continue() )

(* New *)
%defs(
  fun addSurrogate s =
    let val u1 = (* Parse hex of high surrogate *)
        val u2 = (* Parse hex of low surrogate *)
    in  addStr (UTF8.encode (0wx10000 + ((u1 andb 0wx3FF) << 0w10) + (u2 andb 0wx3FF))
    end
)

%let nonSurrogateHex = [0-9a-ce-fA-CE-F]{xdigit}{3}|[dD][0-7]{xdigit}{2};
%let highSurrogateHex = [dD][89abAB]{xdigit}{2};
%let lowSurrogateHex = [dD][c-fC-F]{xdigit}{2};

<S>"\\u"{nonSurrogateHex} => ( addUChar yytext; continue() )
<S>"\\u"{highSurrogateHex}"\\u"{lowSurrogateHex} => ( addSurrogate yytext; continue() )
<S>"\\u"{xdigit}{4} => ( raise Fail "Unmatched surrogate pair" )

The text was updated successfully, but these errors were encountered:

JohnReppy · 2023-06-11T17:02:33Z

For the first part of this (the ULexBuffer module), I recently pushed a fix for this problem to the legacy branch, but I want to do some more testing of the code before porting that fix to the development branch.

For the second part, do we need some sort of validity checking for valid surrogate pairs or do we just treat any two hex escapes in series as a pair?

Skyb0rg007 · 2023-06-11T22:24:38Z

We need to explicitly match escapes with specific values, since the only escapes that need to be handled differently are those in the surrogate pair range. For example "\u0000\uD800\uDFFF\u0000" should be treated as 3 codepoints (U+0000, U+103FF, U+0000), and "\u0000\uDFFF\uD800\u0000" is invalid.

Looking how other JSON parsers do it, the jq utility matches sequences of escapes, then handles surrogates in the code part. The OCaml yojson and the Rust serde libraries seems to use a form of state transition to handle surrogates. However I do think that explicitly matching consecutive hexcodes in the surrogate range is the easiest and most readable solution for ml-ulex.

I did realize that my example can be somewhat simplified to:

%let nonSurrogateHex = [0-9a-ce-fA-CE-F]{xdigit}{3}|[dD][0-7]{xdigit}{2};
%let surrogateHex = [dD][89a-fA-F]{xdigit}{2};

<S>"\\u"{nonSurrogateHex} => ( .. )
<S>"\\u"{surrogateHex}"\\u"{surrogateHex} => ( (* Assert high and low surrogate values here *) )
<S>"\\u"{surrogateHex} => ( raise Fail "Lone surrogate" )

Skyb0rg007 · 2023-06-12T23:10:52Z

I took a look at the patch submitted to the legacy branch, however it doesn't seem like it catches invalid surrogate pairs in the input stream:

- CM.make "./ml-lpt-lib.cm";
- val stream = TextIO.openString (UTF8.encode 0wxD800);
- val ustream = ULexBuffer.mkStream (0, fn () => TextIO.input stream);
- ULexBuffer.getu ustream;
SOME (0wxD800, S (...))

Because the input is not valid UTF8, the library should raise an exception and not pass on the invalid codepoint to the caller.

JohnReppy · 2023-06-13T13:07:33Z

That patch just deals with the problem that 4-byte UTF-8 sequences were resulting in the Incomplete exception being raised. Surrogate pairs is a separate issue and there is no vetting of the validity of UTF-8 code points.

JohnReppy · 2023-06-13T23:06:31Z

I've pushed another patch the the legacy branch so that the ULexBuffer.getu function will raise the Invalid exception on surrogate halves and code points that are too large (greater than U+10FFFF). ML-ULex ought to be modified to resync the input stream when it encounters incomplete or invalid UTF-8 code sequences, that will take some more work.
The issue with the JSONParser not handling Unicode escape sequences still needs to be addressed.

Skyb0rg007 added the enhancement New feature or request label Jun 10, 2023

Skyb0rg007 mentioned this issue Oct 10, 2023

ML-ULex's memoization causes massive performance penalties for JSON parsing smlnj/legacy#284

Closed

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode support: ml-ulex runtime, JSONParser #271

Unicode support: ml-ulex runtime, JSONParser #271

Skyb0rg007 commented Jun 10, 2023

JohnReppy commented Jun 11, 2023

Skyb0rg007 commented Jun 11, 2023

Skyb0rg007 commented Jun 12, 2023

JohnReppy commented Jun 13, 2023

JohnReppy commented Jun 13, 2023

Unicode support: ml-ulex runtime, JSONParser #271

Unicode support: ml-ulex runtime, JSONParser #271

Comments

Skyb0rg007 commented Jun 10, 2023

Description

Improvements

Unicode Input

Unicode Escapes

JohnReppy commented Jun 11, 2023

Skyb0rg007 commented Jun 11, 2023

Skyb0rg007 commented Jun 12, 2023

JohnReppy commented Jun 13, 2023

JohnReppy commented Jun 13, 2023