Handle the case where a split is about to happen in the middle of a UTF-16 surrogate pair #399

xyzhangfred · 2024-03-18T22:58:25Z

I've identified an issue in the split_str function where splitting a string at an offset that divides a surrogate pair results in incorrect behavior. Specifically, when split_str is called with an offset that bisects a surrogate pair, the entire character is retained in the left part of the split, leaving the right part empty. This can lead to inaccuracies in length calculations ( there are places in the code where the "offset" is taken to be the correct length of the left splice, after the split) and potential empty strings where they're not expected.

Example:

Input: split_str("🌉", 1, OffsetKind::Utf16)
Current Output: left = "🌉", right = ""
Expected Behavior: Ideally, the function should either somehow handle surrogate pairs gracefully (although I am not sure if that's possible with rust),or handle it in a way that does not lead to incorrect string lengths/ block splice lengths.

Temporary Workaround:
I've implemented a temporary workaround that replaces problematic splits with empty strings. This approach prevents crashes and highlights the issue, though it's far from an ideal solution.

This workaround is intended as a stopgap measure. I welcome suggestions for a more elegant and robust solution to this problem. Please feel free to discuss this further or propose alternatives.

…tf 16 surrogate pair

xyzhangfred · 2024-03-19T01:58:43Z

for reference: #386

Horusiath · 2024-03-19T04:28:21Z

I don't think that's the root issue, but nonetheless the fact that you've found a repeatable case is very helpful. I'll take a look at it during Thu-Fri. The fix will arrive as part of v0.18.1 release by the end of this week.

Horusiath · 2024-03-22T08:13:13Z

@xyzhangfred wouldn't you mind attaching the test case to recreate the issue? What I did was using test case from original issue:

let doc = Doc::with_options(Options {
    // use UTF-16, since this is the offset kind that Yjs uses (because of UTF16 string encoding in JavaScript)
    offset_kind: OffsetKind::Utf16, 
    ..Default::default()
});
let txt = doc.get_or_insert_text("quill");
let mut txn = doc.transact_mut();
let str = "🇿🇿🇿🇿🇩🇩🇩🇿🇩🇩🇩🇩🇩🇿🇩🇩🇿🇩🇿🇩";
println!("length: {}", str.len());
txt.insert(&mut txn, 0, str);
txt.remove_range(&mut txn, 0, 5);
// txt: 🇿🇩🇩🇩🇿🇩🇩🇩🇩🇩🇿🇩🇩🇿🇩🇿🇩

However this test doesn't bring any errors. I've also checked this in Yjs:

const doc = new Y.Doc()
const txt = doc.getText('quill')
const str  ="🇿🇿🇿🇿🇩🇩🇩🇿🇩🇩🇩🇩🇩🇿🇩🇩🇿🇩🇿🇩"
txt.insert(0, str)
txt.delete(0, 5)
console.log(txt.toString()) // �🇿🇩🇩🇩🇿🇩🇩🇩🇩🇩🇿🇩🇩🇿🇩🇿🇩

Since the split point was defined in a middle of multi-code character it was replaced by surrogate.

xyzhangfred · 2024-03-23T00:09:42Z

Maybe it only happens in the C FFI? I used the following code to reproduce a crash (replacing main.cpp in y-crdt/tests-ffi, compile and run).

`#include <stdio.h>
#include <string.h>

extern "C" {
#include "include/libyrs.h"
};

int main(){

YDoc* doc = ydoc_new();
Branch* txt = ytext(doc, "quill");
YTransaction* txn = ydoc_write_transaction(doc, 0, NULL);
ytext_insert(txt, txn, 0, "🇿🇿🇿🇿🇩🇩🇩🇿🇩🇩🇩🇩🇩🇿🇩🇩🇿🇩🇿🇩", NULL);
ytext_remove_range(txt, txn, 0, 5);

}`

Horusiath · 2024-03-27T09:33:32Z

After checking out this code more thoroughly, I see that your example couldn't work in the first place when using using the default OffsetKind (which targets UTF8 byte lenghts per character). Deleting byte-range of 0..5, when affected characters are 4 bytes long, means splitting a character in half which is not allowed operation in Rust.

Changing OffsetKind to Utf16 (so the same lengths as Yjs/JavaScript uses) fixes the issue. Example:

TEST_CASE("Unicode support") {
    YOptions o = yoptions();
    o.encoding = Y_OFFSET_UTF16;
    YDoc* doc = ydoc_new_with_options(o);
    Branch* txt = ytext(doc, "quill");
    YTransaction* txn = ydoc_write_transaction(doc, 0, NULL);

    ytext_insert(txt, txn, 0, u8"🇿🇿🇿🇿🇩🇩🇩🇿🇩🇩🇩🇩🇩🇿🇩🇩🇿🇩🇿🇩", NULL);
    ytext_remove_range(txt, txn, 0, 5);

    char* actual = ytext_string(txt, txn);
    REQUIRE(!strcmp(actual, u8"🇿🇩🇩🇩🇿🇩🇩🇩🇩🇩🇿🇩🇩🇿🇩🇿🇩"));

    ystring_destroy(actual);
    ytransaction_commit(txn);
    ydoc_destroy(doc);
}

handle the case where a split is about to happen in the middle of a u…

3d5f8fe

…tf 16 surrogate pair

xyzhangfred mentioned this pull request Mar 18, 2024

Delete set on compound emojis can lead to block with size 0, leading to crash due to underflow #386

Open

Horusiath mentioned this pull request Mar 26, 2024

added unicode tests to yffi #406

Closed

Horusiath mentioned this pull request Mar 27, 2024

Remove unnecessary Branch pointer wrapping in yffi event target #408

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle the case where a split is about to happen in the middle of a UTF-16 surrogate pair #399

Handle the case where a split is about to happen in the middle of a UTF-16 surrogate pair #399

xyzhangfred commented Mar 18, 2024

xyzhangfred commented Mar 19, 2024

Horusiath commented Mar 19, 2024

Horusiath commented Mar 22, 2024 •

edited

xyzhangfred commented Mar 23, 2024

Horusiath commented Mar 27, 2024 •

edited

Handle the case where a split is about to happen in the middle of a UTF-16 surrogate pair #399

Are you sure you want to change the base?

Handle the case where a split is about to happen in the middle of a UTF-16 surrogate pair #399

Conversation

xyzhangfred commented Mar 18, 2024

xyzhangfred commented Mar 19, 2024

Horusiath commented Mar 19, 2024

Horusiath commented Mar 22, 2024 • edited

xyzhangfred commented Mar 23, 2024

Horusiath commented Mar 27, 2024 • edited

Horusiath commented Mar 22, 2024 •

edited

Horusiath commented Mar 27, 2024 •

edited