lib/fs: Add ASCII fastpath for normalization #9365

bt90 · 2024-01-23T13:15:21Z

Purpose

Avoids unicode.ToLower(unicode.ToUpper(r)) and norm.NFC.String(s) for simple ASCII filenames.

lib/fs/folding.go

bt90 · 2024-01-23T14:16:31Z

lib/fs/folding.go

calmh · 2024-01-23T16:58:53Z

Test case and benchmark... how much faster is this than the old stuff, for the various cases, since it's a performance optimisation?

AudriusButkevicius · 2024-01-23T19:04:43Z

Yes, benchmarks please.

bt90 · 2024-01-24T10:33:16Z

Added a benchmark. Comparison with the previous implementation:

goos: linux
goarch: amd64
pkg: github.com/syncthing/syncthing/lib/fs
cpu: AMD EPYC 7763 64-Core Processor                
                                     │ /tmp/old.txt  │            /tmp/new.txt             │
                                     │    sec/op     │   sec/op     vs base                │
UnicodeLowercase/lowercase_ASCII-2      48.34n ±  7%   17.25n ± 8%  -64.32% (p=0.000 n=10)
UnicodeLowercase/mixedcase_ASCII-2     191.45n ± 14%   75.91n ± 6%  -60.35% (p=0.000 n=10)
UnicodeLowercase/lowercase_unicode-2    127.2n ±  4%   235.0n ± 7%  +84.75% (p=0.000 n=10)
UnicodeLowercase/mixedcase_unicode-2    347.2n ±  7%   315.6n ± 6%   -9.13% (p=0.001 n=10)
UnicodeLowercase/multibyte_unicode-2    537.2n ±  6%   586.1n ± 3%   +9.10% (p=0.000 n=10)
geomean                                 185.5n         141.6n       -23.67%

                                     │ /tmp/old.txt │            /tmp/new.txt             │
                                     │     B/op     │    B/op     vs base                 │
UnicodeLowercase/lowercase_ASCII-2     0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/mixedcase_ASCII-2     24.00 ± 0%     24.00 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/lowercase_unicode-2   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/mixedcase_unicode-2   32.00 ± 0%     32.00 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/multibyte_unicode-2   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
geomean                                           ²               +0.00%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                                     │ /tmp/old.txt │            /tmp/new.txt             │
                                     │  allocs/op   │ allocs/op   vs base                 │
UnicodeLowercase/lowercase_ASCII-2     0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/mixedcase_ASCII-2     1.000 ± 0%     1.000 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/lowercase_unicode-2   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/mixedcase_unicode-2   1.000 ± 0%     1.000 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/multibyte_unicode-2   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
geomean                                           ²               +0.00%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

The ASCII fastpath seems to be working as intended but there's still a severe regression for the lowercase unicode testcase.

rasa · 2024-01-24T16:39:12Z

lib/fs/folding.go

- for _, r := range s[i:] {
- rs.WriteRune(unicode.ToLower(unicode.ToUpper(r)))
+func toLower(r rune) rune {
+ if r <= unicode.MaxASCII {


This could be optimized another ~50% via this bikeshed: https://stackoverflow.com/a/77689444/1432614

We're dealing with 4 cases:

ASCII lower -> no-op

ASCII mixed -> replace uppercase ones

Unicode lower -> apply NFC

Unicode mixed -> replace uppercase ones and apply NFC

The linked logic only offers fast ASCII detection but lacks the upper/lower distinction.

bt90 · 2024-01-25T11:15:32Z

Doing things in single pass:

goos: linux
goarch: amd64
pkg: github.com/syncthing/syncthing/lib/fs
cpu: AMD EPYC 7763 64-Core Processor                
                                     │ /tmp/old.txt │            /tmp/new.txt             │
                                     │    sec/op    │   sec/op     vs base                │
UnicodeLowercase/ASCII_lowercase-2      47.80n ± 5%   17.77n ± 9%  -62.83% (p=0.000 n=10)
UnicodeLowercase/ASCII_mixedcase-2     193.65n ± 9%   53.66n ± 3%  -72.29% (p=0.000 n=10)
UnicodeLowercase/Unicode_lowercase-2    124.3n ± 2%   161.6n ± 7%  +29.96% (p=0.000 n=10)
UnicodeLowercase/Unicode_mixedcase-2    345.9n ± 9%   292.4n ± 7%  -15.48% (p=0.000 n=10)
UnicodeLowercase/Unicode_multibyte-2    538.5n ± 2%   564.7n ± 5%   +4.86% (p=0.000 n=10)
geomean                                 184.6n        120.5n       -34.71%

                                     │ /tmp/old.txt │            /tmp/new.txt             │
                                     │     B/op     │    B/op     vs base                 │
UnicodeLowercase/ASCII_lowercase-2     0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/ASCII_mixedcase-2     24.00 ± 0%     24.00 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/Unicode_lowercase-2   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/Unicode_mixedcase-2   32.00 ± 0%     32.00 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/Unicode_multibyte-2   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
geomean                                           ²               +0.00%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                                     │ /tmp/old.txt │            /tmp/new.txt             │
                                     │  allocs/op   │ allocs/op   vs base                 │
UnicodeLowercase/ASCII_lowercase-2     0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/ASCII_mixedcase-2     1.000 ± 0%     1.000 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/Unicode_lowercase-2   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/Unicode_mixedcase-2   1.000 ± 0%     1.000 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/Unicode_multibyte-2   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
geomean                                           ²               +0.00%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

imsodin

Orthogonal idea from looking at the pprof in https://forum.syncthing.net/t/380k-files-24gb-in-directory-results-in-very-slow-sync-8-days/21488/26:
Quite a bit of time is spent on memory management - re-using bytes slices with a pool might help here.

lib/fs/folding_test.go

imsodin · 2024-01-28T11:29:01Z

lib/fs/folding.go

+ for i := 0; i < len(s); i++ {
+ c := s[i]
+ if c > unicode.MaxASCII {
+ return false, isLower


Given unicode_lowercase is the slow bench result with the new implementatin, maybe it would be worth testing keeping that behaviour from the old code. I.e. when encountering non-ascii, keep checking if there's any upper-case chars (and then exit). So you can then take the lower-case-unicode shortcut from before. This will make the unicode-mixed-case slower, but I wouldn't be surprised if not by a lot compared to the gains for unicode-lower.

That's the point I don't really understand. golangs strings.Map() has that shortcut built into it:

https://cs.opensource.google/go/go/+/refs/tags/go1.21.6:src/strings/strings.go;l=489-517

My previous (flawed?) mixedcase unicode benchmark case should have stopped at the very first byte in the ASCII detection. The rest of the code looks pretty much the same to me? Map the runes and if a case change is detected, assign a string builder with the unchanged part and write mapped runes after that.

I completely glossed over the usage of strings.Map xD

That's still one or two more loops. If you did the unicode lower detection in our initial loop that does ascii detection, and then not use strings.Map, we get rid of part of a loop in both the unicode cases.
Now that also makes me wonder if we could wrap isASCII and toLowerAscii into a mapping function for strings.Map for less code branches/complexity, and equal, maybe even slightly better performance :)

Also we are now mixing two classes of optimisations: Improving things about the unicode path, and adding an ascii shortcut. Maybe worth doing that separately for optimal results. Just an idle, optional suggestion though.

The ASCII path iterates over single bytes which is fast but doesn't work as soon as any unicode characters are involved. Iterating over runes is a tad slower but necessary in that case.

So you either end up with two loops or one rune based loop which is slower for ASCII inputs.

I also tried to get rid of lower(upper(r)) as far as possible as there are only 23(!) characters in unicode that satisfy lower(r) != lower(upper(r))

Found it, but apparently the type isn't exported by the unicode package which renders the SpecialCase unusable for us.

Naive code generation attempt:

import ( "fmt" "unicode" ) func main() { fmt.Println("unicode.SpecialCase{") for r := rune(0); r <= unicode.MaxRune; r++ { lower := unicode.ToLower(r) lowerUpper := unicode.ToLower(unicode.ToUpper(r)) if lower == lowerUpper { continue } hr := fmt.Sprintf("0x%04x", r) lr := fmt.Sprintf("0x%04x", lowerUpper) fmt.Printf("\tunicode.CaseRange{%s, %s, d{0, %s - %s, 0}}, // %s\n", hr, hr, lr, hr, string(r)) } fmt.Println("}") }

unicode.SpecialCase{ unicode.CaseRange{0x00b5, 0x00b5, d{0, 0x03bc - 0x00b5, 0}}, // µ unicode.CaseRange{0x0131, 0x0131, d{0, 0x0069 - 0x0131, 0}}, // ı unicode.CaseRange{0x017f, 0x017f, d{0, 0x0073 - 0x017f, 0}}, // ſ unicode.CaseRange{0x0345, 0x0345, d{0, 0x03b9 - 0x0345, 0}}, // ͅ unicode.CaseRange{0x03c2, 0x03c2, d{0, 0x03c3 - 0x03c2, 0}}, // ς unicode.CaseRange{0x03d0, 0x03d0, d{0, 0x03b2 - 0x03d0, 0}}, // ϐ unicode.CaseRange{0x03d1, 0x03d1, d{0, 0x03b8 - 0x03d1, 0}}, // ϑ unicode.CaseRange{0x03d5, 0x03d5, d{0, 0x03c6 - 0x03d5, 0}}, // ϕ unicode.CaseRange{0x03d6, 0x03d6, d{0, 0x03c0 - 0x03d6, 0}}, // ϖ unicode.CaseRange{0x03f0, 0x03f0, d{0, 0x03ba - 0x03f0, 0}}, // ϰ unicode.CaseRange{0x03f1, 0x03f1, d{0, 0x03c1 - 0x03f1, 0}}, // ϱ unicode.CaseRange{0x03f5, 0x03f5, d{0, 0x03b5 - 0x03f5, 0}}, // ϵ unicode.CaseRange{0x1c80, 0x1c80, d{0, 0x0432 - 0x1c80, 0}}, // ᲀ unicode.CaseRange{0x1c81, 0x1c81, d{0, 0x0434 - 0x1c81, 0}}, // ᲁ unicode.CaseRange{0x1c82, 0x1c82, d{0, 0x043e - 0x1c82, 0}}, // ᲂ unicode.CaseRange{0x1c83, 0x1c83, d{0, 0x0441 - 0x1c83, 0}}, // ᲃ unicode.CaseRange{0x1c84, 0x1c84, d{0, 0x0442 - 0x1c84, 0}}, // ᲄ unicode.CaseRange{0x1c85, 0x1c85, d{0, 0x0442 - 0x1c85, 0}}, // ᲅ unicode.CaseRange{0x1c86, 0x1c86, d{0, 0x044a - 0x1c86, 0}}, // ᲆ unicode.CaseRange{0x1c87, 0x1c87, d{0, 0x0463 - 0x1c87, 0}}, // ᲇ unicode.CaseRange{0x1c88, 0x1c88, d{0, 0xa64b - 0x1c88, 0}}, // ᲈ unicode.CaseRange{0x1e9b, 0x1e9b, d{0, 0x1e61 - 0x1e9b, 0}}, // ẛ unicode.CaseRange{0x1fbe, 0x1fbe, d{0, 0x03b9 - 0x1fbe, 0}}, // ι }

If the exceptions are that few it certainly seems valid to pre-generate the table and then use that plus regular ToLower in order to save on the more complex operations.

Oh, the unexported d is a bummer. Wonder what the thinking there is.

I played a bit with unicode.RangeTable but couldn't improve the performance:

var complexFolds = &unicode.RangeTable{ R16: []unicode.Range16{ {181, 305, 124}, {383, 837, 454}, {962, 976, 14}, {977, 981, 4}, {982, 1008, 26}, {1009, 1013, 4}, {7296, 7304, 1}, {7835, 8126, 291}, }, R32: []unicode.Range32{}, LatinOffset: 0, } func toLower(r rune) rune { if r <= unicode.MaxASCII { if r < 'A' || 'Z' < r { return r } return r + 'a' - 'A' } if unicode.Is(complexFolds, r) { return unicode.To(unicode.LowerCase, unicode.To(unicode.UpperCase, r)) } return unicode.To(unicode.LowerCase, r) }

bt90 · 2024-01-28T11:46:37Z

Latest numbers:

goos: windows
goarch: amd64
pkg: github.com/syncthing/syncthing/lib/fs
cpu: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
                                     │   old.txt    │               new.txt               │
                                     │    sec/op    │   sec/op     vs base                │
UnicodeLowercase/ASCII_lowercase-8      36.62n ± 1%   12.91n ± 1%  -64.75% (p=0.000 n=10)
UnicodeLowercase/ASCII_mixedcase-8     181.95n ± 1%   63.24n ± 1%  -65.25% (p=0.000 n=10)
UnicodeLowercase/Unicode_lowercase-8    111.2n ± 1%   125.1n ± 1%  +12.51% (p=0.000 n=10)
UnicodeLowercase/Unicode_mixedcase-8    290.2n ± 0%   188.7n ± 1%  -34.96% (p=0.000 n=10)
UnicodeLowercase/Unicode_multibyte-8    479.7n ± 0%   488.5n ± 0%   +1.85% (p=0.000 n=10)
geomean                                 159.5n        98.80n       -38.04%

                                     │   old.txt    │               new.txt               │
                                     │     B/op     │    B/op     vs base                 │
UnicodeLowercase/ASCII_lowercase-8     0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/ASCII_mixedcase-8     24.00 ± 0%     24.00 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/Unicode_lowercase-8   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/Unicode_mixedcase-8   32.00 ± 0%     32.00 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/Unicode_multibyte-8   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
geomean                                           ²               +0.00%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                                     │   old.txt    │               new.txt               │
                                     │  allocs/op   │ allocs/op   vs base                 │
UnicodeLowercase/ASCII_lowercase-8     0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/ASCII_mixedcase-8     1.000 ± 0%     1.000 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/Unicode_lowercase-8   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/Unicode_mixedcase-8   1.000 ± 0%     1.000 ± 0%       ~ (p=1.000 n=10) ¹
UnicodeLowercase/Unicode_multibyte-8   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=10) ¹
geomean                                           ²               +0.00%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

rdslw · 2024-03-24T13:48:48Z

@bt90 Would you like me to check some test build on this very machine, let me know. Eager to see this implemented/released.

bt90 · 2024-03-24T14:13:58Z

Unfortunately my time is rather limited these days. We moved last week and I still have a lot of work to do around the house.

But feel free to test and post your results.

bt90 added 2 commits January 23, 2024 13:04

Add ascii fastpath

8f744d8

Use WriteByte for single byte runes

18ca1f6

calmh reviewed Jan 23, 2024

View reviewed changes

lib/fs/folding.go Outdated Show resolved Hide resolved

bt90 added 2 commits January 23, 2024 15:50

Use strings.Map

d358a61

lower ascii fastpath

79a74db

bt90 commented Jan 23, 2024

View reviewed changes

lib/fs/folding.go Outdated Show resolved Hide resolved

bt90 requested a review from calmh January 23, 2024 16:50

bt90 added 3 commits January 24, 2024 09:11

Test ascii detection

97a209c

Extract test cases

3008c6f

Add ascii vs unicode benchmark

28002a7

bt90 added 3 commits January 24, 2024 11:28

lower ascii noop

b2a1dd2

Rename testcases

3b8f86b

upper ascii fastpath

8a9806e

rasa reviewed Jan 24, 2024

View reviewed changes

WIP singlepass

90e76b1

bt90 added 5 commits January 25, 2024 11:41

wip extract functions

de792df

early continue

946cdbb

skip ascii aware ToX

80a7236

Add Latin1 fastpath

ff0c88c

avoid utf8 import

ad5c4ae

imsodin reviewed Jan 28, 2024

View reviewed changes

bt90 added 2 commits January 28, 2024 13:00

Add more benchmark cases

dcdf29d

Rename testcases

37cb464

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lib/fs: Add ASCII fastpath for normalization #9365

lib/fs: Add ASCII fastpath for normalization #9365

bt90 commented Jan 23, 2024

bt90 commented Jan 23, 2024

calmh commented Jan 23, 2024

AudriusButkevicius commented Jan 23, 2024

bt90 commented Jan 24, 2024

rasa Jan 24, 2024

bt90 Jan 26, 2024

bt90 commented Jan 25, 2024

imsodin left a comment

imsodin Jan 28, 2024

bt90 Jan 28, 2024

imsodin Jan 28, 2024

imsodin Jan 28, 2024

bt90 Jan 28, 2024

bt90 Jan 29, 2024

bt90 Jan 29, 2024 •

edited

calmh Jan 31, 2024

calmh Jan 31, 2024 •

edited

bt90 Jan 31, 2024

bt90 commented Jan 28, 2024

rdslw commented Mar 24, 2024

bt90 commented Mar 24, 2024

lib/fs: Add ASCII fastpath for normalization #9365

Are you sure you want to change the base?

lib/fs: Add ASCII fastpath for normalization #9365

Conversation

bt90 commented Jan 23, 2024

Purpose

bt90 commented Jan 23, 2024

calmh commented Jan 23, 2024

AudriusButkevicius commented Jan 23, 2024

bt90 commented Jan 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bt90 commented Jan 25, 2024

imsodin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bt90 Jan 29, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

calmh Jan 31, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bt90 commented Jan 28, 2024

rdslw commented Mar 24, 2024

bt90 commented Mar 24, 2024

bt90 Jan 29, 2024 •

edited

calmh Jan 31, 2024 •

edited