Handle the case where a split is about to happen in the middle of a UTF-16 surrogate pair #399
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I've identified an issue in the split_str function where splitting a string at an offset that divides a surrogate pair results in incorrect behavior. Specifically, when split_str is called with an offset that bisects a surrogate pair, the entire character is retained in the left part of the split, leaving the right part empty. This can lead to inaccuracies in length calculations ( there are places in the code where the "offset" is taken to be the correct length of the left splice, after the split) and potential empty strings where they're not expected.
Example:
Input: split_str("🌉", 1, OffsetKind::Utf16)
Current Output: left = "🌉", right = ""
Expected Behavior: Ideally, the function should either somehow handle surrogate pairs gracefully (although I am not sure if that's possible with rust),or handle it in a way that does not lead to incorrect string lengths/ block splice lengths.
Temporary Workaround:
I've implemented a temporary workaround that replaces problematic splits with empty strings. This approach prevents crashes and highlights the issue, though it's far from an ideal solution.
This workaround is intended as a stopgap measure. I welcome suggestions for a more elegant and robust solution to this problem. Please feel free to discuss this further or propose alternatives.