-
Notifications
You must be signed in to change notification settings - Fork 810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replacing lodash string functions with native one requires special care for Unicode strings with non-BMP symbols #321
Labels
Comments
laithshadeed
changed the title
Replacing lodash string functions with native one requires special care for Unicode strings
Replacing lodash string functions with native one requires special care for Unicode strings with non-BMP symbols
Aug 19, 2021
Comparison of some methods: const str = '🐅-👨👩👧-நி-깍-葛󠄀'; naive, splitstr.split('');
// (20) ["\ud83d", '\udc05', '-', '\ud83d', '\udc68', '', '\ud83d', '\udc69', '', '\ud83d', '\udc67', '-', 'ந', 'ி', '-', '깍', '-', '葛', '\udb40', '\udd00'] slightly better, spread operator[...str]
// (15) ["🐅", '-', '👨', '', '👩', '', '👧', '-', 'ந', 'ி', '-', '깍', '-', '葛', '󠄀'] In supported browsers, Intl.Segmenter[...new Intl.Segmenter().segment(str)].map((g) => g.segment);
// (9) ["🐅", '-', '👨👩👧', '-', 'நி', '-', '깍', '-', '葛󠄀'] graphemer 1.4.0import Graphemer from 'graphemer';
const splitter = new Graphemer();
splitter.splitGraphemes(str);
// (9) ["🐅", '-', '👨👩👧', '-', 'நி', '-', '깍', '-', '葛󠄀'] lodash 4.17.10import _ from 'lodash';
_.split(str, '');
// (11) ["🐅", '-', '👨👩👧', '-', 'ந', 'ி', '-', '깍', '-', '葛', '󠄀'] fabric.js v6.0.0-beta10 graphemeSplit (internal function)import { graphemeSplit } from './fabric_graphemeSplit';
graphemeSplit(str);
// (15) ["🐅", '-', '👨', '', '👩', '', '👧', '-', 'ந', 'ி', '-', '깍', '-', '葛', '󠄀'] @formatjs Intl.Segmenter 11.4.2 polyfillawait import('@formatjs/intl-segmenter/polyfill-force');
[...new Intl.Segmenter().segment(str)].map((g) => g.segment);
// (9) ["🐅", '-', '👨👩👧', '-', 'நி', '-', '깍', '-', '葛󠄀'] |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
For example, the native split function
'😀-hi-🐅'.split('')
will break your string compared to lodash_.'😀-hi-🐅'
because it failed to recognize emojis as a single symbol and instead splits its surrogate pairs into two pieces. It is the same reason why calling length on emojis returns two instead of one'😀'.length
Lodash takes special care if your string has non-BMP symbols for example emojis. To correctly split '😀-hi-🐅'; you can use the spread operator:
[...'😀-hi-🐅']
But even the spread operator does not handle grapheme clusters. For that, you need the Unicode Text Segmentation algorithm. Chrome already implemented the algorithm in Intl.Segmenter in 87. You can use the algorithm like this:
[...(new Intl.Segmenter).segment('😀-hi-🐅')].map(x => x.segment)
More about Unicode issues in Javascript in: https://mathiasbynens.be/notes/javascript-unicode
Happy passing emojis around 😀
The text was updated successfully, but these errors were encountered: