Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(lexer): add SIMD optimization to the lexer #2338

Draft
wants to merge 58 commits into
base: main
Choose a base branch
from

Conversation

dyxushuai
Copy link

@dyxushuai dyxushuai commented Feb 7, 2024

Related: #2296

Summary

This implementation can support matches multiple delimiters (0..256) at once on multiple bytes. For example:
On the CPU which supports AVX2 instruction, and we have 10 delimiters need to be matched, the we can matches
32bytes * 10delimiters at once!

Supported features:

  • AVX2 (256) in x86
  • SSE4.2 (128) in x86
  • NEON (128) in aarch64

Supported pattern

Tricky issues:

  • Use u8 instead of char with SIMD.
    • Refactor the Source API.
    • Refactor the method of the character-searching.
  • Add padding to the width of the SIMD instruction when the length is not enough.

@github-actions github-actions bot added the A-parser Area - Parser label Feb 7, 2024
@dyxushuai dyxushuai marked this pull request as draft February 7, 2024 10:33
@dyxushuai
Copy link
Author

+r @Boshen @overlookmotel
Please feel free to give any suggestions.

@overlookmotel
Copy link
Collaborator

overlookmotel commented Feb 9, 2024

Hi @dyxushuai.

I'm afraid I'm having difficulty reviewing this, as it's quite complex and you're covering all the bases in one go. Can I suggest a few things which would make it simpler to understand:

  1. Only implement for now for one arch (I suggest x86_64 SSE/AVX2, as it'll then run on CI).
  2. Check for CPU support at compile time only. Only add the complications for checking at runtime once the main impl is solid.
  3. Use the new additions to Source / SourcePosition APIs which were added in last 24 hours.
  4. Only optimize the main loop. There's now a more efficient scalar version implemented with a search batch/byte-by-byte fallback structure. Hopefully this will be easier to work on top of. And, once some tweaks to it in perf(parser): optimize lexing strings #2366 are merged, it should not change for a while.
  5. Comment the code more fully. e.g. What is arf and where do the values in it come from? What does each __mm256 instruction do?

Ideally, CI would also pass. Though if that's difficult, feel free to ask for review again without that.

@dyxushuai
Copy link
Author

dyxushuai commented Feb 11, 2024

I am on vacation and have been offline for a while.

  • Only implement for now for one arch (I suggest x86_64 SSE/AVX2, as it'll then run on CI).

Yes, I will add more test cases in my implements

  • Check for CPU support at compile time only. Only add the complications for checking at runtime once the main impl is solid.

Sure, Always try to minimize overhead as much as possible.

  • Use the new additions to Source / SourcePosition APIs which were added in last 24 hours.

Cool

  • Only optimize the main loop. There's now a more efficient scalar version implemented with a search batch/byte-by-byte fallback structure. Hopefully this will be easier to work on top of. And, once some tweaks to it in perf(parser): optimize lexing strings #2366 are merged, it should not change for a while.

Ok.

  • Comment the code more fully. e.g. What is arf and where do the values in it come from? What does each __mm256 instruction do?

I will explain the instructions as much as possible. More details here: seanmonstar/httparse#89 (comment)

@Brooooooklyn
Copy link

are u planning to implement the wasm simd 128 in this pull request?

@dyxushuai
Copy link
Author

All tests passed, next I will add benchmark comparison.

@dyxushuai
Copy link
Author

dyxushuai commented Feb 13, 2024

Currently, we only support SIMD matches for ASCII codes (<128) because NEON supports only 6x16 vectors. If we aim to handle all u8::MAX bytes at once, we would require a minimum of 16x16 vector support. BTW, we can split the lookups into two parts table_lo and table_hi in NEON. Therefore, some match tables cannot be accelerated with SIMD currently. For example:

static LINE_BREAK_TABLE: SafeByteMatchTable =
safe_byte_match_table!(|b| matches!(b, b'\r' | b'\n' | LS_OR_PS_FIRST));
static MULTILINE_COMMENT_START_TABLE: SafeByteMatchTable =
safe_byte_match_table!(|b| matches!(b, b'*' | b'\r' | b'\n' | LS_OR_PS_FIRST));

LS_OR_PS_FIRST is not an ASCII code(226), perhaps we can split it into two matches.
b'*' | b'\r' | b'\n' and cold path LS_OR_PS_FIRST.

@dyxushuai
Copy link
Author

dyxushuai commented Feb 13, 2024

@overlookmotel @Brooooooklyn @Boshen

Bench parser

For now, we only support parse String, Whitespace and Identifiers with SIMD. And the performance of parsing small files RadixUIAdoptionSection.jsx has some degradation.

Platforms

  • x86: ryzen7840hs
  • aarch64: m1pro

Commands

  • baseline: cargo bench -p oxc_benchmark --bench parser -- --save-baseline 1
  • AVX2: RUSTFLAGS='-C target-feature=+avx2' cargo bench -p oxc_benchmark --bench parser -- --baseline 1
  • SSE4.2: RUSTFLAGS='-C target-feature=+sse4.2' cargo bench -p oxc_benchmark --bench parser -- --baseline 1

With AVX2

parser/checker.ts       time:   [13.561 ms 13.583 ms 13.607 ms]
                        change: [-7.8967% -7.7080% -7.5257%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  13 (13.00%) high mild
parser/cal.com.tsx      time:   [6.2570 ms 6.2633 ms 6.2704 ms]
                        change: [-4.9506% -4.8115% -4.6668%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) high mild
  6 (6.00%) high severe
parser/RadixUIAdoptionSection.jsx
                        time:   [13.837 µs 13.862 µs 13.890 µs]
                        change: [+1.7289% +1.9880% +2.2377%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) high mild
  5 (5.00%) high severe
parser/pdf.mjs          time:   [4.4353 ms 4.4394 ms 4.4440 ms]
                        change: [-5.4993% -5.3649% -5.2352%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe
parser/antd.js          time:   [27.754 ms 27.782 ms 27.812 ms]
                        change: [-4.6780% -4.5492% -4.4233%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  9 (9.00%) high mild
  2 (2.00%) high severe

With SSE4.2

parser/checker.ts       time:   [14.199 ms 14.216 ms 14.234 ms]
                        change: [-5.4070% -5.2090% -5.0262%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild
parser/cal.com.tsx      time:   [6.3727 ms 6.3808 ms 6.3898 ms]
                        change: [-3.4983% -3.3151% -3.1209%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe
parser/RadixUIAdoptionSection.jsx
                        time:   [13.954 µs 13.994 µs 14.037 µs]
                        change: [+1.6748% +1.9549% +2.2534%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe
parser/pdf.mjs          time:   [4.5373 ms 4.5431 ms 4.5494 ms]
                        change: [-3.8748% -3.6909% -3.4907%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  8 (8.00%) high mild
  1 (1.00%) high severe
parser/antd.js          time:   [28.388 ms 28.422 ms 28.458 ms]
                        change: [-4.3719% -4.1995% -4.0308%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  8 (8.00%) high mild

With NEON

TODO

@overlookmotel
Copy link
Collaborator

overlookmotel commented Feb 13, 2024

@dyxushuai This is looking really great. Like I said, I won't have time to get into it properly this week. But at a first glance it's shaping up really nicely. Can't wait to see the benchmarks.

A couple of points for OXC maintainers:

Is there any reason not to run CI and benchmarks on this now? I don't want to press the button myself, since I'm not sure why it's not automatic anyway.

Regardless of the results and quality of the implementation, in my opinion I don't think this is safe to merge as yet. This introduces different implementations for different processors, and to make sure nothing is breaking/regressing (either now or in the future), it'd be ideal to have tests, conformance, and benchmarks run on all these targets (as discussed in #2285).

Either we'd need to get all that in place before merging, or possibly the parts which target the architectures which are already tested on CI could be merged sooner, and the rest later (as long as the scalar fallbacks which run on other archs for now are tested+benched too).

This isn't a negative on your fine work dyxushuai, just this is a major change, and I think we need to tread carefully.

@dyxushuai
Copy link
Author

are u planning to implement the wasm simd 128 in this pull request?

https://doc.rust-lang.org/beta/core/arch/wasm/fn.i8x16_shuffle.html I see that WASM has enough SIMD instructions to implement the same lookup table algorithm as other platforms.

@Boshen
Copy link
Member

Boshen commented Feb 13, 2024

@dyxushuai Try https://github.com/BurntSushi/critcmp


This whole PR is shaping up really nicely, I think the hardest part for us maintainers is how we continuously integrate SIMD into the codebase, let my take a few days to think about this.


I'd like to invite some SIMD experts to review this PR, is it possible to document the entry functions so anyone can read the code from start to bottom without knowing any context?

@Boshen
Copy link
Member

Boshen commented Feb 13, 2024

@dyxushuai If this is fun for you, and if you are willing to maintain a separate crate for all the simd stuff inside the oxc project, I'll give you all the correct permissions so we can setup proper test infrastructure in another github repository.

Copy link

codspeed-hq bot commented Feb 13, 2024

CodSpeed Performance Report

Merging #2338 will degrade performances by 35.07%

Comparing dyxushuai:feat/simd_in_lexer (03ee45d) with main (83cb78f)

Summary

❌ 15 regressions
✅ 12 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main dyxushuai:feat/simd_in_lexer Change
lexer[RadixUIAdoptionSection.jsx] 118.1 µs 163.5 µs -27.73%
lexer[antd.js] 127.4 ms 185.7 ms -31.4%
lexer[cal.com.tsx] 30.9 ms 47.2 ms -34.57%
lexer[checker.ts] 69.8 ms 107.5 ms -35.07%
lexer[pdf.mjs] 19.4 ms 29.1 ms -33.47%
minifier[react.development.js] 10.3 ms 11.2 ms -7.86%
minifier[typescript.js] 1.5 s 1.7 s -7.78%
parser[RadixUIAdoptionSection.jsx] 386 µs 407.6 µs -5.29%
parser[antd.js] 664.9 ms 722.4 ms -7.96%
parser[cal.com.tsx] 149.7 ms 164.2 ms -8.83%
parser[checker.ts] 340.8 ms 378.4 ms -9.95%
parser[pdf.mjs] 107.3 ms 117.1 ms -8.33%
transformer[antd.js] 1.6 s 1.7 s -3.46%
transformer[checker.ts] 1.1 s 1.1 s -3.45%
transformer[pdf.mjs] 283.9 ms 293.7 ms -3.32%

@dyxushuai
Copy link
Author

@dyxushuai If this is fun for you, and if you are willing to maintain a separate crate for all the simd stuff inside the oxc project, I'll give you all the correct permissions so we can setup proper test infrastructure in another github repository.

cool i want to try

@Boshen
Copy link
Member

Boshen commented Feb 13, 2024

@dyxushuai https://github.com/oxc-project/oxc-simd/invitations

@dyxushuai
Copy link
Author

dyxushuai commented Feb 14, 2024

CodSpeed Performance Report

Merging #2338 will degrade performances by 35.07%

Comparing dyxushuai:feat/simd_in_lexer (03ee45d) with main (83cb78f)

Summary

❌ 15 regressions ✅ 12 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main dyxushuai:feat/simd_in_lexer Change
lexer[RadixUIAdoptionSection.jsx] 118.1 µs 163.5 µs -27.73%
lexer[antd.js] 127.4 ms 185.7 ms -31.4%
lexer[cal.com.tsx] 30.9 ms 47.2 ms -34.57%
lexer[checker.ts] 69.8 ms 107.5 ms -35.07%
lexer[pdf.mjs] 19.4 ms 29.1 ms -33.47%
minifier[react.development.js] 10.3 ms 11.2 ms -7.86%
minifier[typescript.js] 1.5 s 1.7 s -7.78%
parser[RadixUIAdoptionSection.jsx] 386 µs 407.6 µs -5.29%
parser[antd.js] 664.9 ms 722.4 ms -7.96%
parser[cal.com.tsx] 149.7 ms 164.2 ms -8.83%
parser[checker.ts] 340.8 ms 378.4 ms -9.95%
parser[pdf.mjs] 107.3 ms 117.1 ms -8.33%
transformer[antd.js] 1.6 s 1.7 s -3.46%
transformer[checker.ts] 1.1 s 1.1 s -3.45%
transformer[pdf.mjs] 283.9 ms 293.7 ms -3.32%

I have identified the root cause, which is that padding was used for buffers that have not enough length with ALIGNMENT, and then more padding was repeatedly added to the end as needed if we haven't finished using it.

image

I will fall back to using the SafeByteMatchTable without SIMD

@overlookmotel
Copy link
Collaborator

@dyxushuai Sorry for long silence. I'll be able to review this over the weekend.

Could you possibly squash the commits and rebase on main? It's quite hard to review at present due to the number of commits.

@dyxushuai
Copy link
Author

dyxushuai commented Feb 23, 2024

@dyxushuai Sorry for long silence. I'll be able to review this over the weekend.

Could you possibly squash the commits and rebase on main? It's quite hard to review at present due to the number of commits.

Of course. Sorry for the delay, I took a long vacation during Chinese New Year. Currently, I am analyzing the performance issue and this is not the final version of my PR.

@overlookmotel
Copy link
Collaborator

overlookmotel commented Feb 23, 2024

No worries at all. If you'd prefer to do more work before rebasing, by all means do that. Just I have some time spare this weekend, and feels like it'd be a good chance to take a look at this properly, even if it's not final. I might have some ideas about the performance issue too, if that'd be helpful.

@Boshen Boshen marked this pull request as draft February 27, 2024 11:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-parser Area - Parser
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants