-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode support: ml-ulex runtime, JSONParser #271
Comments
For the first part of this (the For the second part, do we need some sort of validity checking for valid surrogate pairs or do we just treat any two hex escapes in series as a pair? |
We need to explicitly match escapes with specific values, since the only escapes that need to be handled differently are those in the surrogate pair range. For example "\u0000\uD800\uDFFF\u0000" should be treated as 3 codepoints ( Looking how other JSON parsers do it, the I did realize that my example can be somewhat simplified to:
|
I took a look at the patch submitted to the legacy branch, however it doesn't seem like it catches invalid surrogate pairs in the input stream:
Because the input is not valid UTF8, the library should raise an exception and not pass on the invalid codepoint to the caller. |
That patch just deals with the problem that 4-byte UTF-8 sequences were resulting in the |
I've pushed another patch the the legacy branch so that the |
Description
Improvements
The current ml-lpt runtime and JSONParser do not properly handle UTF-8.
The proposed improvements would solve this issue.
Unicode Input
The
ml-lpt
runtime library (specifically theULexBuffer
module) does not properly handle UTF-8 inputs, raising errors on valid bytes and ignoring others.Comparing the
UTF8.getu
toULexBuffer.getu
reveals the issue:ULexBuffer
never parses 4-byte Unicode sequences.An easy fix is to update these implementations to be in-line with one another.
Additionally, the
ULexBuffer
module accepts invalid UTF8, and as a result can produce output strings which are invalid UTF8.UTF8-encoded strings are not allowed to use codepoints in the surrogate pair range (0xD800-0xDFFF).
From Wikipedia:
The current implementation of
ULexBuffer.getu
does not account for this.Note: I do not propose changing the implementation of
UTF8.encode
, as it is sometimes useful to encode strings into WTF-8.Unicode Escapes
The current implementation of the JSON lexer does not properly handle unicode inputs and escapes.
From the JSON Spec (pages 12-13):
However, the current version of the
JSONParser
library does not implement this required part of the spec:Instead, the lexer simply encodes the escapes into WTF-8.
The best fix is to specifically match for surrogate pairs in the lexer spec:
The text was updated successfully, but these errors were encountered: