Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle the case where a split is about to happen in the middle of a UTF-16 surrogate pair #399

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

xyzhangfred
Copy link

I've identified an issue in the split_str function where splitting a string at an offset that divides a surrogate pair results in incorrect behavior. Specifically, when split_str is called with an offset that bisects a surrogate pair, the entire character is retained in the left part of the split, leaving the right part empty. This can lead to inaccuracies in length calculations ( there are places in the code where the "offset" is taken to be the correct length of the left splice, after the split) and potential empty strings where they're not expected.

Example:

Input: split_str("🌉", 1, OffsetKind::Utf16)
Current Output: left = "🌉", right = ""
Expected Behavior: Ideally, the function should either somehow handle surrogate pairs gracefully (although I am not sure if that's possible with rust),or handle it in a way that does not lead to incorrect string lengths/ block splice lengths.

Temporary Workaround:
I've implemented a temporary workaround that replaces problematic splits with empty strings. This approach prevents crashes and highlights the issue, though it's far from an ideal solution.

This workaround is intended as a stopgap measure. I welcome suggestions for a more elegant and robust solution to this problem. Please feel free to discuss this further or propose alternatives.

@xyzhangfred
Copy link
Author

for reference: #386

@Horusiath
Copy link
Collaborator

I don't think that's the root issue, but nonetheless the fact that you've found a repeatable case is very helpful. I'll take a look at it during Thu-Fri. The fix will arrive as part of v0.18.1 release by the end of this week.

@Horusiath
Copy link
Collaborator

Horusiath commented Mar 22, 2024

@xyzhangfred wouldn't you mind attaching the test case to recreate the issue? What I did was using test case from original issue:

let doc = Doc::with_options(Options {
    // use UTF-16, since this is the offset kind that Yjs uses (because of UTF16 string encoding in JavaScript)
    offset_kind: OffsetKind::Utf16, 
    ..Default::default()
});
let txt = doc.get_or_insert_text("quill");
let mut txn = doc.transact_mut();
let str = "🇿🇿🇿🇿🇩🇩🇩🇿🇩🇩🇩🇩🇩🇿🇩🇩🇿🇩🇿🇩";
println!("length: {}", str.len());
txt.insert(&mut txn, 0, str);
txt.remove_range(&mut txn, 0, 5);
// txt: 🇿🇩🇩🇩🇿🇩🇩🇩🇩🇩🇿🇩🇩🇿🇩🇿🇩

However this test doesn't bring any errors. I've also checked this in Yjs:

const doc = new Y.Doc()
const txt = doc.getText('quill')
const str  ="🇿🇿🇿🇿🇩🇩🇩🇿🇩🇩🇩🇩🇩🇿🇩🇩🇿🇩🇿🇩"
txt.insert(0, str)
txt.delete(0, 5)
console.log(txt.toString()) // �🇿🇩🇩🇩🇿🇩🇩🇩🇩🇩🇿🇩🇩🇿🇩🇿🇩

Since the split point was defined in a middle of multi-code character it was replaced by surrogate.

@xyzhangfred
Copy link
Author

Maybe it only happens in the C FFI? I used the following code to reproduce a crash (replacing main.cpp in y-crdt/tests-ffi, compile and run).

`#include <stdio.h>
#include <string.h>

extern "C" {
#include "include/libyrs.h"
};

int main(){

YDoc* doc = ydoc_new();
Branch* txt = ytext(doc, "quill");
YTransaction* txn = ydoc_write_transaction(doc, 0, NULL);
ytext_insert(txt, txn, 0, "🇿🇿🇿🇿🇩🇩🇩🇿🇩🇩🇩🇩🇩🇿🇩🇩🇿🇩🇿🇩", NULL);
ytext_remove_range(txt, txn, 0, 5); 

}`

@Horusiath
Copy link
Collaborator

Horusiath commented Mar 27, 2024

After checking out this code more thoroughly, I see that your example couldn't work in the first place when using using the default OffsetKind (which targets UTF8 byte lenghts per character). Deleting byte-range of 0..5, when affected characters are 4 bytes long, means splitting a character in half which is not allowed operation in Rust.

Changing OffsetKind to Utf16 (so the same lengths as Yjs/JavaScript uses) fixes the issue. Example:

TEST_CASE("Unicode support") {
    YOptions o = yoptions();
    o.encoding = Y_OFFSET_UTF16;
    YDoc* doc = ydoc_new_with_options(o);
    Branch* txt = ytext(doc, "quill");
    YTransaction* txn = ydoc_write_transaction(doc, 0, NULL);

    ytext_insert(txt, txn, 0, u8"🇿🇿🇿🇿🇩🇩🇩🇿🇩🇩🇩🇩🇩🇿🇩🇩🇿🇩🇿🇩", NULL);
    ytext_remove_range(txt, txn, 0, 5);

    char* actual = ytext_string(txt, txn);
    REQUIRE(!strcmp(actual, u8"🇿🇩🇩🇩🇿🇩🇩🇩🇩🇩🇿🇩🇩🇿🇩🇿🇩"));

    ystring_destroy(actual);
    ytransaction_commit(txn);
    ydoc_destroy(doc);
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants