feat(lexer): add SIMD optimization to the lexer #2338

dyxushuai · 2024-02-07T10:33:25Z

Related: #2296

Summary

This implementation can support matches multiple delimiters (0..256) at once on multiple bytes. For example:
On the CPU which supports AVX2 instruction, and we have 10 delimiters need to be matched, the we can matches
32bytes * 10delimiters at once!

Supported features:

AVX2 (256) in x86
SSE4.2 (128) in x86
NEON (128) in aarch64

Supported pattern

Tricky issues:

~~Use u8 instead of char with SIMD.~~
- ~~Refactor the Source API.~~
- ~~Refactor the method of the character-searching.~~
Add padding to the width of the SIMD instruction when the length is not enough.

dyxushuai · 2024-02-07T12:09:31Z

+r @Boshen @overlookmotel
Please feel free to give any suggestions.

overlookmotel · 2024-02-09T15:37:38Z

Hi @dyxushuai.

I'm afraid I'm having difficulty reviewing this, as it's quite complex and you're covering all the bases in one go. Can I suggest a few things which would make it simpler to understand:

Only implement for now for one arch (I suggest x86_64 SSE/AVX2, as it'll then run on CI).
Check for CPU support at compile time only. Only add the complications for checking at runtime once the main impl is solid.
Use the new additions to Source / SourcePosition APIs which were added in last 24 hours.
Only optimize the main loop. There's now a more efficient scalar version implemented with a search batch/byte-by-byte fallback structure. Hopefully this will be easier to work on top of. And, once some tweaks to it in perf(parser): optimize lexing strings #2366 are merged, it should not change for a while.
Comment the code more fully. e.g. What is arf and where do the values in it come from? What does each __mm256 instruction do?

Ideally, CI would also pass. Though if that's difficult, feel free to ask for review again without that.

dyxushuai · 2024-02-11T14:38:09Z

I am on vacation and have been offline for a while.

Only implement for now for one arch (I suggest x86_64 SSE/AVX2, as it'll then run on CI).

Yes, I will add more test cases in my implements

Check for CPU support at compile time only. Only add the complications for checking at runtime once the main impl is solid.

Sure, Always try to minimize overhead as much as possible.

Use the new additions to Source / SourcePosition APIs which were added in last 24 hours.

Cool

Only optimize the main loop. There's now a more efficient scalar version implemented with a search batch/byte-by-byte fallback structure. Hopefully this will be easier to work on top of. And, once some tweaks to it in perf(parser): optimize lexing strings #2366 are merged, it should not change for a while.

Ok.

Comment the code more fully. e.g. What is arf and where do the values in it come from? What does each __mm256 instruction do?

I will explain the instructions as much as possible. More details here: seanmonstar/httparse#89 (comment)

…exer

…ng.rs

Brooooooklyn · 2024-02-13T07:07:37Z

are u planning to implement the wasm simd 128 in this pull request?

…eat/simd_in_lexer

dyxushuai · 2024-02-13T09:58:07Z

All tests passed, next I will add benchmark comparison.

dyxushuai · 2024-02-13T10:03:10Z

Currently, we only support SIMD matches for ASCII codes (<128) because NEON supports only 6x16 vectors. If we aim to handle all u8::MAX bytes at once, we would require a minimum of 16x16 vector support. BTW, we can split the lookups into two parts table_lo and table_hi in NEON. Therefore, some match tables cannot be accelerated with SIMD currently. For example:

oxc/crates/oxc_parser/src/lexer/comment.rs

Lines 18 to 22 in 065584c

 static LINE_BREAK_TABLE: SafeByteMatchTable = 

 safe_byte_match_table!(|b| matches!(b, b'\r' | b'\n' | LS_OR_PS_FIRST)); 

 static MULTILINE_COMMENT_START_TABLE: SafeByteMatchTable = 

 safe_byte_match_table!(|b| matches!(b, b'*' | b'\r' | b'\n' | LS_OR_PS_FIRST));

LS_OR_PS_FIRST is not an ASCII code(226), perhaps we can split it into two matches.
b'*' | b'\r' | b'\n' and cold path LS_OR_PS_FIRST.

dyxushuai · 2024-02-13T12:41:57Z

@overlookmotel @Brooooooklyn @Boshen

Bench parser

For now, we only support parse String, Whitespace and Identifiers with SIMD. And the performance of parsing small files RadixUIAdoptionSection.jsx has some degradation.

Platforms

x86: ryzen7840hs
aarch64: m1pro

Commands

baseline: cargo bench -p oxc_benchmark --bench parser -- --save-baseline 1
AVX2: RUSTFLAGS='-C target-feature=+avx2' cargo bench -p oxc_benchmark --bench parser -- --baseline 1
SSE4.2: RUSTFLAGS='-C target-feature=+sse4.2' cargo bench -p oxc_benchmark --bench parser -- --baseline 1

With AVX2

parser/checker.ts       time:   [13.561 ms 13.583 ms 13.607 ms]
                        change: [-7.8967% -7.7080% -7.5257%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  13 (13.00%) high mild
parser/cal.com.tsx      time:   [6.2570 ms 6.2633 ms 6.2704 ms]
                        change: [-4.9506% -4.8115% -4.6668%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) high mild
  6 (6.00%) high severe
parser/RadixUIAdoptionSection.jsx
                        time:   [13.837 µs 13.862 µs 13.890 µs]
                        change: [+1.7289% +1.9880% +2.2377%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) high mild
  5 (5.00%) high severe
parser/pdf.mjs          time:   [4.4353 ms 4.4394 ms 4.4440 ms]
                        change: [-5.4993% -5.3649% -5.2352%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe
parser/antd.js          time:   [27.754 ms 27.782 ms 27.812 ms]
                        change: [-4.6780% -4.5492% -4.4233%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  9 (9.00%) high mild
  2 (2.00%) high severe

With SSE4.2

parser/checker.ts       time:   [14.199 ms 14.216 ms 14.234 ms]
                        change: [-5.4070% -5.2090% -5.0262%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild
parser/cal.com.tsx      time:   [6.3727 ms 6.3808 ms 6.3898 ms]
                        change: [-3.4983% -3.3151% -3.1209%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe
parser/RadixUIAdoptionSection.jsx
                        time:   [13.954 µs 13.994 µs 14.037 µs]
                        change: [+1.6748% +1.9549% +2.2534%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe
parser/pdf.mjs          time:   [4.5373 ms 4.5431 ms 4.5494 ms]
                        change: [-3.8748% -3.6909% -3.4907%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  8 (8.00%) high mild
  1 (1.00%) high severe
parser/antd.js          time:   [28.388 ms 28.422 ms 28.458 ms]
                        change: [-4.3719% -4.1995% -4.0308%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  8 (8.00%) high mild

With NEON

TODO

overlookmotel · 2024-02-13T12:54:40Z

@dyxushuai This is looking really great. Like I said, I won't have time to get into it properly this week. But at a first glance it's shaping up really nicely. Can't wait to see the benchmarks.

A couple of points for OXC maintainers:

Is there any reason not to run CI and benchmarks on this now? I don't want to press the button myself, since I'm not sure why it's not automatic anyway.

Regardless of the results and quality of the implementation, in my opinion I don't think this is safe to merge as yet. This introduces different implementations for different processors, and to make sure nothing is breaking/regressing (either now or in the future), it'd be ideal to have tests, conformance, and benchmarks run on all these targets (as discussed in #2285).

Either we'd need to get all that in place before merging, or possibly the parts which target the architectures which are already tested on CI could be merged sooner, and the rest later (as long as the scalar fallbacks which run on other archs for now are tested+benched too).

This isn't a negative on your fine work dyxushuai, just this is a major change, and I think we need to tread carefully.

dyxushuai · 2024-02-13T13:01:33Z

are u planning to implement the wasm simd 128 in this pull request?

https://doc.rust-lang.org/beta/core/arch/wasm/fn.i8x16_shuffle.html I see that WASM has enough SIMD instructions to implement the same lookup table algorithm as other platforms.

Boshen · 2024-02-13T14:53:03Z

@dyxushuai Try https://github.com/BurntSushi/critcmp

This whole PR is shaping up really nicely, I think the hardest part for us maintainers is how we continuously integrate SIMD into the codebase, let my take a few days to think about this.

I'd like to invite some SIMD experts to review this PR, is it possible to document the entry functions so anyone can read the code from start to bottom without knowing any context?

Boshen · 2024-02-13T14:55:43Z

@dyxushuai If this is fun for you, and if you are willing to maintain a separate crate for all the simd stuff inside the oxc project, I'll give you all the correct permissions so we can setup proper test infrastructure in another github repository.

codspeed-hq · 2024-02-13T14:56:45Z

CodSpeed Performance Report

Merging #2338 will degrade performances by 35.07%

_{Comparing dyxushuai:feat/simd_in_lexer (03ee45d) with main (83cb78f)}

Summary

❌ 15 regressions
✅ 12 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

	Benchmark	`main`	`dyxushuai:feat/simd_in_lexer`	Change
❌	`lexer[RadixUIAdoptionSection.jsx]`	118.1 µs	163.5 µs	-27.73%
❌	`lexer[antd.js]`	127.4 ms	185.7 ms	-31.4%
❌	`lexer[cal.com.tsx]`	30.9 ms	47.2 ms	-34.57%
❌	`lexer[checker.ts]`	69.8 ms	107.5 ms	-35.07%
❌	`lexer[pdf.mjs]`	19.4 ms	29.1 ms	-33.47%
❌	`minifier[react.development.js]`	10.3 ms	11.2 ms	-7.86%
❌	`minifier[typescript.js]`	1.5 s	1.7 s	-7.78%
❌	`parser[RadixUIAdoptionSection.jsx]`	386 µs	407.6 µs	-5.29%
❌	`parser[antd.js]`	664.9 ms	722.4 ms	-7.96%
❌	`parser[cal.com.tsx]`	149.7 ms	164.2 ms	-8.83%
❌	`parser[checker.ts]`	340.8 ms	378.4 ms	-9.95%
❌	`parser[pdf.mjs]`	107.3 ms	117.1 ms	-8.33%
❌	`transformer[antd.js]`	1.6 s	1.7 s	-3.46%
❌	`transformer[checker.ts]`	1.1 s	1.1 s	-3.45%
❌	`transformer[pdf.mjs]`	283.9 ms	293.7 ms	-3.32%

dyxushuai · 2024-02-13T15:02:29Z

@dyxushuai If this is fun for you, and if you are willing to maintain a separate crate for all the simd stuff inside the oxc project, I'll give you all the correct permissions so we can setup proper test infrastructure in another github repository.

cool i want to try

Boshen · 2024-02-13T15:06:26Z

@dyxushuai https://github.com/oxc-project/oxc-simd/invitations

dyxushuai · 2024-02-14T03:13:51Z

CodSpeed Performance Report

Merging #2338 will degrade performances by 35.07%

Comparing dyxushuai:feat/simd_in_lexer (03ee45d) with main (83cb78f)

Summary

❌ 15 regressions ✅ 12 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main dyxushuai:feat/simd_in_lexer Change
❌ lexer[RadixUIAdoptionSection.jsx] 118.1 µs 163.5 µs -27.73%
❌ lexer[antd.js] 127.4 ms 185.7 ms -31.4%
❌ lexer[cal.com.tsx] 30.9 ms 47.2 ms -34.57%
❌ lexer[checker.ts] 69.8 ms 107.5 ms -35.07%
❌ lexer[pdf.mjs] 19.4 ms 29.1 ms -33.47%
❌ minifier[react.development.js] 10.3 ms 11.2 ms -7.86%
❌ minifier[typescript.js] 1.5 s 1.7 s -7.78%
❌ parser[RadixUIAdoptionSection.jsx] 386 µs 407.6 µs -5.29%
❌ parser[antd.js] 664.9 ms 722.4 ms -7.96%
❌ parser[cal.com.tsx] 149.7 ms 164.2 ms -8.83%
❌ parser[checker.ts] 340.8 ms 378.4 ms -9.95%
❌ parser[pdf.mjs] 107.3 ms 117.1 ms -8.33%
❌ transformer[antd.js] 1.6 s 1.7 s -3.46%
❌ transformer[checker.ts] 1.1 s 1.1 s -3.45%
❌ transformer[pdf.mjs] 283.9 ms 293.7 ms -3.32%

I have identified the root cause, which is that padding was used for buffers that have not enough length with ALIGNMENT, and then more padding was repeatedly added to the end as needed if we haven't finished using it.

I will fall back to using the SafeByteMatchTable without SIMD

overlookmotel · 2024-02-23T11:11:25Z

@dyxushuai Sorry for long silence. I'll be able to review this over the weekend.

Could you possibly squash the commits and rebase on main? It's quite hard to review at present due to the number of commits.

dyxushuai · 2024-02-23T13:19:06Z

@dyxushuai Sorry for long silence. I'll be able to review this over the weekend.

Could you possibly squash the commits and rebase on main? It's quite hard to review at present due to the number of commits.

Of course. Sorry for the delay, I took a long vacation during Chinese New Year. Currently, I am analyzing the performance issue and this is not the final version of my PR.

overlookmotel · 2024-02-23T15:42:08Z

No worries at all. If you'd prefer to do more work before rebasing, by all means do that. Just I have some time spare this weekend, and feels like it'd be a good chance to take a look at this properly, even if it's not final. I might have some ideas about the performance issue too, if that'd be helpful.

Add dependencies and modules for SIMD processing

5799392

github-actions bot added the A-parser Area - Parser label Feb 7, 2024

dyxushuai marked this pull request as draft February 7, 2024 10:33

dyxushuai added 9 commits February 7, 2024 18:34

Fix string lexer bug

4448c42

Fix trailing ones count in LookupTable

c7dcf4d

Refactor SIMD string literal matching

c1e6e81

Refactor LookupTable struct in avx2.rs

0de2eea

Add inline attribute to tabulate function

cb4c7d1

Refactor string literal matching for SIMD optimization

61d772d

Refactor SIMD lookup table and match function

7c7d851

Refactor string literal reading loop in lexer.rs

fab52e1

Refactor SIMD string literal lookup

669a36b

overlookmotel mentioned this pull request Feb 8, 2024

SIMD in the lexer #2296

Open

Boshen requested a review from overlookmotel February 9, 2024 13:05

dyxushuai added 10 commits February 10, 2024 16:58

Refactor SIMD alignment constants

a79cae4

Refactor SIMD lookup table implementation

c685fb8

feat: add swar support&refactor API

1e1b649

refactor: avx2 with new match API

e90fe65

Refactor uniform_segment function to use usize::from_ne_bytes

cd94169

Remove itertools dependency and update SIMD code

bd45550

Update SIMD implementations and constants

c5b2ffd

Refactor SIMD delimiter matching

8aa2ab0

feat: add NEON support

12c6b5c

Refactor Source struct in lexer module

6681376

dyxushuai added 3 commits February 11, 2024 22:43

Merge branch 'main' of github.com:oxc-project/oxc into feat/simd_in_l…

4ba6a84

…exer

doc: improve comments

068543e

Refactor lexer string literal handling

6d12c6f

dyxushuai added 2 commits February 13, 2024 11:07

Refactor matches() method signature in lexer/search.rs and lexer/stri…

207213d

…ng.rs

Fix identifier matching bug and optimize SIMD implementation

a61d7e0

dyxushuai added 6 commits February 13, 2024 15:56

Add debug information and handle out-of-bounds case in match_vectored

afcd883

Refactor match_vectored to matches in lexer search.rs and simd/mod.rs

bd06245

Refactor SIMD match table generation

1cc3818

Merge branch 'feat/simd_in_lexer' of gh_personal:dyxushuai/oxc into f…

850dcbb

…eat/simd_in_lexer

Refactor MatchTable::new to remove const fn

ddf2ce5

Update dependencies and remove unused code

065584c

dyxushuai added 4 commits February 13, 2024 18:29

Update SEARCH_BATCH_SIZE constant and ALIGNMENT values

586aa9f

Fix SIMD byte match table logic

d6c6e15

Update SIMD match table for SSE4.2 support

e75942b

Update SIMD match table for SSE42

03ee45d

dyxushuai added 2 commits February 23, 2024 17:04

improve: use the iterator for caching the matched bytes

7e1597e

Refactor SIMD matching functions

006e776

Boshen marked this pull request as draft February 27, 2024 11:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(lexer): add SIMD optimization to the lexer #2338

feat(lexer): add SIMD optimization to the lexer #2338

dyxushuai commented Feb 7, 2024 •

edited

dyxushuai commented Feb 7, 2024

overlookmotel commented Feb 9, 2024 •

edited

dyxushuai commented Feb 11, 2024 •

edited

Brooooooklyn commented Feb 13, 2024

dyxushuai commented Feb 13, 2024

dyxushuai commented Feb 13, 2024 •

edited

dyxushuai commented Feb 13, 2024 •

edited

overlookmotel commented Feb 13, 2024 •

edited

dyxushuai commented Feb 13, 2024

Boshen commented Feb 13, 2024 •

edited

Boshen commented Feb 13, 2024

codspeed-hq bot commented Feb 13, 2024

dyxushuai commented Feb 13, 2024

Boshen commented Feb 13, 2024

dyxushuai commented Feb 14, 2024 •

edited

CodSpeed Performance Report

Merging #2338 will degrade performances by 35.07%

Summary

Benchmarks breakdown

overlookmotel commented Feb 23, 2024

dyxushuai commented Feb 23, 2024 •

edited

overlookmotel commented Feb 23, 2024 •

edited

feat(lexer): add SIMD optimization to the lexer #2338

Are you sure you want to change the base?

feat(lexer): add SIMD optimization to the lexer #2338

Conversation

dyxushuai commented Feb 7, 2024 • edited

Summary

Supported features:

Supported pattern

Tricky issues:

dyxushuai commented Feb 7, 2024

overlookmotel commented Feb 9, 2024 • edited

dyxushuai commented Feb 11, 2024 • edited

Brooooooklyn commented Feb 13, 2024

dyxushuai commented Feb 13, 2024

dyxushuai commented Feb 13, 2024 • edited

dyxushuai commented Feb 13, 2024 • edited

Bench parser

Platforms

Commands

With AVX2

With SSE4.2

With NEON

overlookmotel commented Feb 13, 2024 • edited

dyxushuai commented Feb 13, 2024

Boshen commented Feb 13, 2024 • edited

Boshen commented Feb 13, 2024

codspeed-hq bot commented Feb 13, 2024

CodSpeed Performance Report

Merging #2338 will degrade performances by 35.07%

Summary

Benchmarks breakdown

dyxushuai commented Feb 13, 2024

Boshen commented Feb 13, 2024

dyxushuai commented Feb 14, 2024 • edited

CodSpeed Performance Report

Merging #2338 will degrade performances by 35.07%

Summary

Benchmarks breakdown

overlookmotel commented Feb 23, 2024

dyxushuai commented Feb 23, 2024 • edited

overlookmotel commented Feb 23, 2024 • edited

dyxushuai commented Feb 7, 2024 •

edited

overlookmotel commented Feb 9, 2024 •

edited

dyxushuai commented Feb 11, 2024 •

edited

dyxushuai commented Feb 13, 2024 •

edited

dyxushuai commented Feb 13, 2024 •

edited

overlookmotel commented Feb 13, 2024 •

edited

Boshen commented Feb 13, 2024 •

edited

dyxushuai commented Feb 14, 2024 •

edited

dyxushuai commented Feb 23, 2024 •

edited

overlookmotel commented Feb 23, 2024 •

edited