SIMD in the lexer #2296

Boshen · 2024-02-04T11:52:29Z

As discussed in #2285, we'd like to embark on the journey of SIMD.

The current state of SIMD in Rust is functional, successful projects include https://github.com/rusticstuff/simdutf8 and https://github.com/BurntSushi/memchr.

portable-simd is not an option due to its instability.

In this task, you will implement one of the tokenization methods for the following tokens in the lexer.

Identifiers

oxc/crates/oxc_parser/src/lexer/identifier.rs

Line 8 in 6002560

pub(super) fn identifier_name_handler(&mut self) -> &'a str {
Strings

oxc/crates/oxc_parser/src/lexer/string.rs

Line 6 in 6002560

pub(super) fn read_string_literal(&mut self, delimiter: char) -> Kind {
Whitespace

oxc/crates/oxc_parser/src/lexer/byte_handlers.rs

Line 101 in 6002560

// <SPACE> <TAB> Normal Whitespace
Comments

oxc/crates/oxc_parser/src/lexer/comment.rs

Line 9 in 6002560

pub(super) fn skip_single_line_comment(&mut self) -> Kind {

oxc/crates/oxc_parser/src/lexer/comment.rs

Line 25 in 6002560

pub(super) fn skip_multi_line_comment(&mut self) -> Kind {
Integers

oxc/crates/oxc_parser/src/lexer/numeric.rs

Line 30 in 6002560

pub(super) fn decimal_literal_after_first_digit(&mut self) -> Kind {

Specification of these tokens can be found at https://tc39.es/ecma262/#sec-ecmascript-language-lexical-grammar

To get things working, you only need to implement for the architecture of your development machine.

For example

Some of these are easy, @overlookmotel drafted some algorithms:

Multi-line comments:

At first search for */, \r, \n or 0xE2 (1st byte of either LS or PS).
Once a line break is found, just search for */ or 0xE2 (as it's not relevant whether there's further line breaks now).
Put handling 0xE2 on a #[cold] branch as irregular line breaks (and other Unicode characters starting with 0xE2) are rare.
Whether all other characters are ASCII or Unicode is irrelevant - no need to check.

Whitespace:

Search for 1st byte which is ASCII but not or \t, or >= 128 (Unicode)
Handle Unicode in #[cold] branch which checks if char is irregular whitespace or not.

i.e. in both, assume input will be ASCII and chew through it as fast as possible. De-opt to slow path for Unicode cases which do need to be handled, but should be rare in practice.

Please leave a comment if you'd like to participate.

The text was updated successfully, but these errors were encountered:

dyxushuai · 2024-02-04T14:00:52Z

I want to give a try

Boshen · 2024-02-04T14:11:26Z

I want to give a try

Go ahead! Please report progress so other contributors don't end up doing the same thing. i.e. which routine and which arch are you target?

overlookmotel · 2024-02-04T19:34:05Z

Thanks for opening this boshen.

Please can I ask could people hold off on tackling identifiers for now? I have a major refactor of lexing identifiers underway which is designed to make it much easier to bring in SIMD later. Should have that done within a week.

A couple of thoughts for anyone who's interested in doing this:

Most of these tasks relate to searching through a block of bytes to find one of a set of "end characters". The "pshufd" instructions (or I think the ARM equivalent is "tbl") I believe will be most efficient for this. I found this article useful on this subject:

http://0x80.pl/articles/simd-byte-lookup.html

Secondly, please see #2288. Hopefully that'll be merged soon, and it adds APIs to the lexer for consuming the source text as bytes instead of chars. That will probably be useful for these tasks.

dyxushuai · 2024-02-05T07:20:50Z

I want to give a try

Go ahead! Please report progress so other contributors don't end up doing the same thing. i.e. which routine and which arch are you target?

I'll begin from Identifiers for both arm and x86 platforms.

overlookmotel · 2024-02-05T12:19:29Z

@dyxushuai Please see my comment above. I suggest it'd be better to start with one of the others - not identifiers.

I have a major overhaul of lexing identifiers in the works, which I expect to turn into a PR within a week. Currently it's on my fork of OXC: overlookmotel#36

dyxushuai · 2024-02-05T14:40:39Z

@dyxushuai Please see my comment above. I suggest it'd be better to start with one of the others - not identifiers.

I have a major overhaul of lexing identifiers in the works, which I expect to turn into a PR within a week. Currently it's on my fork of OXC: overlookmotel#36

@overlookmotel Sorry, I didn't pay close attention to the details of your PR. I'll start with the second one Strings and keep an eye on your PR as well. CC @Boshen

FYI: There are already some good implementations for quickly searching characters, like ¹, ²

overlookmotel · 2024-02-05T19:07:39Z

Thanks. And thanks for the links. Very interesting.

httparse looks like it's a practical implementation of the "pshufd" lookup table technique discussed in article I mentioned earlier (certainly the SSE4.2 implementation, which is using _mm_shuffle_epi8). Really great to see an actual Rust implementation of that.

My PR for identifiers is a total mess at present, so not worth reading really. But point is that it completely churns up the code that's in main branch at present, so it'd be difficult for us both to work on that area simultaneously.

By the way, in case you didn't notice, #2288 was merged today. That provides some more efficient APIs for reading source as bytes, which may come in useful.

overlookmotel · 2024-02-05T19:10:25Z

@sno2 Also seems to be working in this area, though his PR #2327 uses memchr (which uses SIMD internally) rather than a hand-coded implementation.

overlookmotel · 2024-02-05T19:16:25Z

PS: One thing I missed in algorithms quoted above:

0x0B and 0x0C (irregular whitespace) also need to be handled to create trivia for them. But, again, #[cold] path.

dyxushuai · 2024-02-06T00:20:17Z

httparse looks like it's a practical implementation of the "pshufd" lookup table technique discussed in article I mentioned earlier (certainly the SSE4.2 implementation, which is using _mm_shuffle_epi8). Really great to see an actual Rust implementation of that.

httparse also made a very interesting small optimization, be like(pseudocode):

while bytes.len() >= 32 {
	// use avx2 do the 32bytes parallel processing
	bytes.advance(32)
} 
// fallback to sse4.2
while bytes.len() >= 16 {
	// use sse4.2 do  the 16bytes parallel processing
	bytes.advance(16)
}
// fallback to register width parallel processing
while bytes.len() >= core::mem::size_of::<usize>() {
	// do parallel processing of register size.
	bytes.advance(core::mem::size_of::<usize>())
}

overlookmotel · 2024-02-06T11:41:55Z

httparse also made a very interesting small optimization, be like(pseudocode)

That's really smart!

I have also been wondering whether we could do something like what ratel does. At the start of parsing, it copies the entire source text into a buffer that's 1 byte longer than the source, and adds a null byte on the end to be a sentinel for EOF.

Then the lexer never has to bounds-check "have I reached the end?" before reading. As long as the EOF sentinel has not been read yet, then it's not at the end.

We could extend this approach: Instead of adding a single null byte to the end, add 32 x null bytes. Then when iterating through the source, there's always guaranteed to be enough bytes remaining to do a 32-byte read without any bounds check. So then no need for specialised handling of "less than 32, but a least 16 bytes remaining". There's always 32.

Side note: I'm not sure why ratel uses 0 as the sentinel. 0xFF could be better, as it cannot appear in valid UTF-8 strings, so it's an unambiguous sentinel value.

My questions about this approach are:

Does the cost of copying the whole source text at the start outweigh the gain of removing bounds checks everywhere else?
Could we add an entry point to the parser which takes a mutable String as source text? If that String already has 32 bytes of excess capacity, could apply this approach without the cost of copying the source text.
For &str input, could we use page table tricks like slice_deque does to create an extended buffer, while only copying the last chunk of the source?
Is there a way to build a safe abstraction which statically forces every part of the lexer to check for and handle the EOF sentinel? (rather than making the entire lexer a pit of unsafe)

For me at least, (4) is the biggest question mark.

Anyway... perhaps that's a side-track, but just wanted to mention it. The relevant parts of ratel, if you're interested:

https://github.com/ratel-rust/ratel-core/blob/master/ratel/src/lexer/mod.rs#L630-L653
https://github.com/ratel-rust/toolshed/blob/master/src/arena.rs#L266-L280

dyxushuai · 2024-02-06T11:54:52Z

add 32 x null bytes

Do you mean add padding bytes at the end of source text? And make the source text exactly a multiple of 32.

dyxushuai · 2024-02-06T11:55:36Z

I'm not sure why ratel uses 0 as the sentinel.

I think it's a habit that comes from C++.

overlookmotel · 2024-02-06T12:27:25Z

Do you mean add padding bytes at the end of source text? And make the source text exactly a multiple of 32.

Yes, exactly that. Except padded length doesn't need to be a multiple of 32, it just needs to be true that:

Wherever you are in the source, there are always at least 32 bytes more (of either source, or padding).

If you want to do aligned reads, could add a bit more padding, to also guarantee:

Wherever you are in source, if you round up the current pointer to next 32-byte aligned position, there are still at least 32 bytes more.

Boshen · 2024-02-06T12:53:40Z

Previous attempt on padding the source:

http://0x80.pl/articles/simd-byte-lookup.html

I can help contact Wojciech Muła on twitter if we have questions to ask.

overlookmotel · 2024-02-06T13:39:10Z

Thanks Boshen. Good to know this has been tried before. I'll read through those issues. May be a dead end, but maybe it could make more sense now, if it improves potential for SIMD.

Boshen · 2024-02-06T13:58:17Z

Yeah it makes more sense now, we can give it another try.

dyxushuai · 2024-02-06T14:05:40Z

@Boshen @overlookmotel Do you have a benchmark setup or framework?

Boshen · 2024-02-06T14:24:31Z

@Boshen @overlookmotel Do you have a benchmark setup or framework?

The current setup is only observable with codspeeed on github linux machines.

I think we need another github repository for a setup similar to https://github.com/BurntSushi/memchr/blob/master/.github/workflows/ci.yml

We also need to use criterion instead of codspeed because codspeed only works on linux. I used to do the benchmark diff like this

oxc/.github/workflows/benchmark.yml

Lines 52 to 68 in dda9c2e

 - name: Run Bench on PR Branch 

 run: cargo benchmark --save-baseline pr 

 - name: Checkout Main Branch 

 uses: actions/checkout@v3 

 with: 

 clean: false 

 ref: main 

 - name: Install Rust Toolchain 

 uses: ./.github/actions/rustup 

 - name: Compile 

 run: cargo build --release -p oxc_benchmark 

 - name: Run Bench on Main Branch 

 run: cargo benchmark --save-baseline main

sno2 · 2024-02-08T03:08:57Z

I figured we should move conversations from #2327 here. @overlookmotel:

I'm currently working on a a re-write of lexing identifiers. It doesn't use SIMD, but is designed to be easily convertable to SIMD later: https://github.com/overlookmotel/oxc/blob/lexer-string5/crates/oxc_parser/src/lexer/identifier.rs
(please excuse the messy code, still very much a work in progress)

You're right about having to bust out to scalar if unicode chars are found, but approach I've taken is to speed through ASCII as fast as possible and then Unicode is a de-opt (and on a #[cold] path so compiler assumes it won't happen in most cases). This approach seems to work - produces about 60% speed-up on the Lexer benchmarks. Obviously using SIMD would be a further speed boost.

This seems pretty good. I have been playing around with implementing various SIMD and parsing optimizations for the skip_single_line_comment algorithm.

This uses some interesting code design helped by orlp in the Rust discord to avoid panic checking and let Rust optimize the assembly for the loops instead of dancing with raw pointers. Furthermore, it implements the SIMD-based shuffling and naive compare methods for 16-byte and 32-byte.

Interestingly enough, the most performant algorithms on my machine have been (1x) simd_naive, (1.34x ± 0.12) memchr, and (1.73x ± 0.09) simd_shuffle. This was a benchmark that just went through the comment bodies scraped from the React development file's source. My machine also does not have avx2, so I am unable to run the wide variants and I would expect the shuffle variant to beat the naive versions once we have more bytes to search.

Hopefully these same approaches can be adapted into the various parsing algorithms in oxc.

overlookmotel · 2024-02-08T05:12:44Z

Nice! I'm heading for bed now so will have to look at that tomorrow (ditto dyxushuai's draft PR #2338).

In meantime, just to share where I'm up to with identifiers as it overlaps with this effort:

I've now refactored byte searching into a generic "byte table search" macro. My intent is this same macro can be used for all other searches discussed here (except for multi-line comments which are different as you're looking for */, not just a single byte).

I've used a macro rather than a function, as found it's faster if all the code is in one function together (even #[inline(always) is a few % slower).

The macro already searches in 32-byte blocks, so I was wondering if, once it's merged, that macro could be SIMD-ized, by replacing the main "check a block of 32" loop with a SIMD implementation? Ditto ByteMatchTable - could it also construct the lookup arrays for SIMD shuffles at compile time?

If either of you have any suggestions of how to make the macro more amenable to that, please do let me know.

Macro: lexer/search.rs
Usage of macro for identifiers: lexer/identifier.rs#L49-L104

dyxushuai · 2024-02-08T05:49:26Z

Nice! I'm heading for bed now so will have to look at that tomorrow (ditto dyxushuai's draft PR #2338).

Thanks, this draft is still in its early stages. There are some major refactorings that need to be done, such as using u8 (ascii) as the delimiter type instead of char for SIMD-based character searching and so on.

overlookmotel · 2024-02-08T06:33:21Z

@sno2 I couldn't resist taking a look at your experiment (bedtime reading).

A few questions:

(1x) simd_naive, (1.34x ± 0.12) memchr, and (1.73x ± 0.09) simd_shuffle. I would expect the shuffle variant to beat the naive versions once we have more bytes to search.

Are you saying simd_naive was the fastest? Or slowest?

And how does SIMD compare to scalar? Does it move the needle significantly?

Out of interest, how do scalar and scalar_lookup compare?

This was a benchmark that just went through the comment bodies scraped from the React development file's source.

If you were testing the string of each comment individually, I'm not sure it'd be very representative. When parsing a large JS file, you will very rarely hit the "not enough bytes left, fallback to scalar" path. Whereas if you're feeding it each comment as an individual string, it'd hit that slow path every time for the end of the comment. (I may well have misunderstood here and that's not what you're doing)

Another thing... You might find there's a speed-up from marking the branches for 0xE2 as #[cold]. i.e. instead of:

if rest[offset_in_chunk] == XX {
  index += skip_single_line_comment_scalar(&rest[offset_in_chunk..]);
}

if rest[offset_in_chunk] == XX {
  #[cold]
  fn handle_e2(rest: &[u8], offset_in_chunk: usize, index: usize) {
    return index + skip_single_line_comment_scalar(&rest[offset_in_chunk..]);
  }
  return handle_e2(rest, offset_in_chunk, index);
}

We have to handle irregular whitespace for the linter, but in practice don't actually expect to find any 99% of the time, and the above communicates that to the compiler. Ugly as hell, but I've had success with it elsewhere in OXC.

Lastly, in the scalar lookup version, I think this:

match SCALAR_LOOKUP[byte as usize] {
  1 => return idx,
  2 => match &string[idx + 1..] {
    &[0x80, 0xA8 | 0xA9, ..] => return idx,
    _ => {}
  },
  _ => {}
}

could be:

match SCALAR_LOOKUP[byte as usize] {
  0 => {},
  1 => return idx,
  // This is UTF-8 so guaranteed 2 more bytes after 0xE2
  2 => match unsafe { string.get_unchecked(idx + 1..idx + 3) } {
    &[0x80, 0xA8 | 0xA9] => return idx,
    _ => {}
  },
  // This may allow compiler to turn this match into a jump table
  // (which may or may not be faster)
  _ => unsafe { std::hint::unreachable_unchecked() }
}

or again, use #[cold] on the match arm for option 2.

overlookmotel · 2024-02-08T06:42:46Z

One more thing (sorry):

let offset_in_chunk = mask.trailing_zeros() as usize;
assert!(offset_in_chunk < 16); // eliminate bound checks
index += offset_in_chunk;
if rest[offset_in_chunk] == XX {

How does the assertion eliminate the bounds check? Doesn't it just turn into a branch with a panic on it?

sno2 · 2024-02-08T17:32:28Z

Are you saying simd_naive was the fastest? Or slowest?

And how does SIMD compare to scalar? Does it move the needle significantly?

Out of interest, how do scalar and scalar_lookup compare?

simd_naive was the fastest. SIMD was around 3-5x faster thanscalar. lookup was consistently faster than scalar by a bit. I will get the CI/CD benchmarking thing working with more test cases so that we can see fair results.

If you were testing the string of each comment individually, I'm not sure it'd be very representative. When parsing a large JS file, you will very rarely hit the "not enough bytes left, fallback to scalar" path. Whereas if you're feeding it each comment as an individual string, it'd hit that slow path every time for the end of the comment. (I may well have misunderstood here and that's not what you're doing)

It goes through the source as a whole (like an iterator), so no.

Another thing... You might find there's a speed-up from marking the branches for 0xE2 as #[cold]. i.e. instead of:

I was unable to reproduce a speedup on my machine, so I took that out. More accurate benchmarking will allow me to see better how it affects the code.

Lastly, in the scalar lookup version, I think this:

Nice, I was also able to get the lookup table optimization for the match by using an enum as the values. Sadly, I think we do need unsafe for the faster 2-byte index equality, though.

https://godbolt.org/z/d6f3xd3fK

How does the assertion eliminate the bounds check? Doesn't it just turn into a branch with a panic on it?

I don't exactly know how, but I think the optimizer does back-checking whenever you assert something such as integer range checks. Then, it notices that rest must contain at least 16 bytes from the rest.len() >= 16 check.

overlookmotel · 2024-02-08T21:47:29Z

I don't exactly know how, but I think the optimizer does back-checking whenever you assert something such as integer range checks. Then, it notices that rest must contain at least 16 bytes from the rest.len() >= 16 check.

Interesting! I had no idea you could get it to do that.

Nice, I was also able to get the lookup table optimization for the match by using an enum as the values.

This is optimizing the wrong part, because 0xE2 never comes up anyway, but just for fun... https://godbolt.org/z/9Te1r3ve7

simd_naive was the fastest.

That's not what I expected at all. So maybe I was wrong about shuffle tables, at least for such a small number of "needles". But good to hear the gain from SIMD is significant. Sounds like this effort will be worthwhile. Please do post again when you have more numbers.

overlookmotel · 2024-02-08T21:55:51Z

@Boshen I just thought: Does the irregular whitespace linter rule need to catch irregular whitespace in comments too? Or can they be ignored?

What about in strings?

Boshen · 2024-02-09T02:29:45Z

@Boshen I just thought: Does the irregular whitespace linter rule need to catch irregular whitespace in comments too? Or can they be ignored?

What about in strings?

I can trade for performance by removing irregular whitespace collection from the lexer.

I can implement the irregular whitespace rule differently.

overlookmotel · 2024-02-09T05:32:43Z

The new byte-search macro is now merged into main (#2352) and used to lex identifiers, and also some whitespace. PR to apply it to strings too is pending (#2357).

If anyone wants to SIMD-ize identifiers, please go ahead now. I'd also be interested to hear if you think the byte_search! macro and ByteMatchTable types are amenable to being extended to use SIMD.

dyxushuai · 2024-02-13T10:18:29Z

This PR #2338 has already supported Identifiers, Strings, Whitespace

Boshen · 2024-02-20T03:21:44Z

Saw a relevant PR: https://github.com/pydantic/jiter/pull/65

overlookmotel mentioned this issue Feb 5, 2024

perf(parser): faster line comment #2327

Closed

dyxushuai mentioned this issue Feb 7, 2024

feat(lexer): add SIMD optimization to the lexer #2338

Draft

13 tasks

overlookmotel mentioned this issue Feb 9, 2024

Remove bounds checks in Lexer by padding source #2368

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIMD in the lexer #2296

SIMD in the lexer #2296

Boshen commented Feb 4, 2024 •

edited

dyxushuai commented Feb 4, 2024

Boshen commented Feb 4, 2024

overlookmotel commented Feb 4, 2024

dyxushuai commented Feb 5, 2024 •

edited

overlookmotel commented Feb 5, 2024

dyxushuai commented Feb 5, 2024 •

edited

overlookmotel commented Feb 5, 2024

overlookmotel commented Feb 5, 2024

overlookmotel commented Feb 5, 2024 •

edited

dyxushuai commented Feb 6, 2024 •

edited

overlookmotel commented Feb 6, 2024 •

edited

dyxushuai commented Feb 6, 2024

dyxushuai commented Feb 6, 2024

overlookmotel commented Feb 6, 2024

Boshen commented Feb 6, 2024

overlookmotel commented Feb 6, 2024 •

edited

Boshen commented Feb 6, 2024

dyxushuai commented Feb 6, 2024

Boshen commented Feb 6, 2024

sno2 commented Feb 8, 2024 •

edited

overlookmotel commented Feb 8, 2024 •

edited

dyxushuai commented Feb 8, 2024 •

edited

overlookmotel commented Feb 8, 2024

overlookmotel commented Feb 8, 2024

sno2 commented Feb 8, 2024 •

edited

overlookmotel commented Feb 8, 2024 •

edited

overlookmotel commented Feb 8, 2024 •

edited

Boshen commented Feb 9, 2024

overlookmotel commented Feb 9, 2024

dyxushuai commented Feb 13, 2024

Boshen commented Feb 20, 2024

SIMD in the lexer #2296

SIMD in the lexer #2296

Comments

Boshen commented Feb 4, 2024 • edited

dyxushuai commented Feb 4, 2024

Boshen commented Feb 4, 2024

overlookmotel commented Feb 4, 2024

dyxushuai commented Feb 5, 2024 • edited

overlookmotel commented Feb 5, 2024

dyxushuai commented Feb 5, 2024 • edited

Footnotes

overlookmotel commented Feb 5, 2024

overlookmotel commented Feb 5, 2024

overlookmotel commented Feb 5, 2024 • edited

dyxushuai commented Feb 6, 2024 • edited

overlookmotel commented Feb 6, 2024 • edited

dyxushuai commented Feb 6, 2024

dyxushuai commented Feb 6, 2024

overlookmotel commented Feb 6, 2024

Boshen commented Feb 6, 2024

overlookmotel commented Feb 6, 2024 • edited

Boshen commented Feb 6, 2024

dyxushuai commented Feb 6, 2024

Boshen commented Feb 6, 2024

sno2 commented Feb 8, 2024 • edited

overlookmotel commented Feb 8, 2024 • edited

dyxushuai commented Feb 8, 2024 • edited

overlookmotel commented Feb 8, 2024

overlookmotel commented Feb 8, 2024

sno2 commented Feb 8, 2024 • edited

overlookmotel commented Feb 8, 2024 • edited

overlookmotel commented Feb 8, 2024 • edited

Boshen commented Feb 9, 2024

overlookmotel commented Feb 9, 2024

dyxushuai commented Feb 13, 2024

Boshen commented Feb 20, 2024

Boshen commented Feb 4, 2024 •

edited

dyxushuai commented Feb 5, 2024 •

edited

dyxushuai commented Feb 5, 2024 •

edited

overlookmotel commented Feb 5, 2024 •

edited

dyxushuai commented Feb 6, 2024 •

edited

overlookmotel commented Feb 6, 2024 •

edited

overlookmotel commented Feb 6, 2024 •

edited

sno2 commented Feb 8, 2024 •

edited

overlookmotel commented Feb 8, 2024 •

edited

dyxushuai commented Feb 8, 2024 •

edited

sno2 commented Feb 8, 2024 •

edited

overlookmotel commented Feb 8, 2024 •

edited

overlookmotel commented Feb 8, 2024 •

edited