Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIMD in the lexer #2296

Open
Boshen opened this issue Feb 4, 2024 · 31 comments
Open

SIMD in the lexer #2296

Boshen opened this issue Feb 4, 2024 · 31 comments

Comments

@Boshen
Copy link
Member

Boshen commented Feb 4, 2024

As discussed in #2285, we'd like to embark on the journey of SIMD.

The current state of SIMD in Rust is functional, successful projects include https://github.com/rusticstuff/simdutf8 and https://github.com/BurntSushi/memchr.

portable-simd is not an option due to its instability.


In this task, you will implement one of the tokenization methods for the following tokens in the lexer.

Specification of these tokens can be found at https://tc39.es/ecma262/#sec-ecmascript-language-lexical-grammar

To get things working, you only need to implement for the architecture of your development machine.

For example


Some of these are easy, @overlookmotel drafted some algorithms:

Multi-line comments:

  • At first search for */, \r, \n or 0xE2 (1st byte of either LS or PS).
  • Once a line break is found, just search for */ or 0xE2 (as it's not relevant whether there's further line breaks now).
  • Put handling 0xE2 on a #[cold] branch as irregular line breaks (and other Unicode characters starting with 0xE2) are rare.
  • Whether all other characters are ASCII or Unicode is irrelevant - no need to check.

Whitespace:

  • Search for 1st byte which is ASCII but not or \t, or >= 128 (Unicode)
  • Handle Unicode in #[cold] branch which checks if char is irregular whitespace or not.

i.e. in both, assume input will be ASCII and chew through it as fast as possible. De-opt to slow path for Unicode cases which do need to be handled, but should be rare in practice.


Please leave a comment if you'd like to participate.

@dyxushuai
Copy link

I want to give a try

@Boshen
Copy link
Member Author

Boshen commented Feb 4, 2024

I want to give a try

Go ahead! Please report progress so other contributors don't end up doing the same thing. i.e. which routine and which arch are you target?

@overlookmotel
Copy link
Collaborator

Thanks for opening this boshen.

Please can I ask could people hold off on tackling identifiers for now? I have a major refactor of lexing identifiers underway which is designed to make it much easier to bring in SIMD later. Should have that done within a week.

A couple of thoughts for anyone who's interested in doing this:

Most of these tasks relate to searching through a block of bytes to find one of a set of "end characters". The "pshufd" instructions (or I think the ARM equivalent is "tbl") I believe will be most efficient for this. I found this article useful on this subject:

http://0x80.pl/articles/simd-byte-lookup.html

Secondly, please see #2288. Hopefully that'll be merged soon, and it adds APIs to the lexer for consuming the source text as bytes instead of chars. That will probably be useful for these tasks.

@dyxushuai
Copy link

dyxushuai commented Feb 5, 2024

I want to give a try

Go ahead! Please report progress so other contributors don't end up doing the same thing. i.e. which routine and which arch are you target?

I'll begin from Identifiers for both arm and x86 platforms.

@overlookmotel
Copy link
Collaborator

@dyxushuai Please see my comment above. I suggest it'd be better to start with one of the others - not identifiers.

I have a major overhaul of lexing identifiers in the works, which I expect to turn into a PR within a week. Currently it's on my fork of OXC: overlookmotel#36

@dyxushuai
Copy link

dyxushuai commented Feb 5, 2024

@dyxushuai Please see my comment above. I suggest it'd be better to start with one of the others - not identifiers.

I have a major overhaul of lexing identifiers in the works, which I expect to turn into a PR within a week. Currently it's on my fork of OXC: overlookmotel#36

@overlookmotel Sorry, I didn't pay close attention to the details of your PR. I'll start with the second one Strings and keep an eye on your PR as well. CC @Boshen

FYI: There are already some good implementations for quickly searching characters, like 1, 2

Footnotes

  1. https://github.com/seanmonstar/httparse/tree/master/src/simd

  2. https://github.com/rust-lang/hashbrown/tree/master/src/raw

@overlookmotel
Copy link
Collaborator

Thanks. And thanks for the links. Very interesting.

httparse looks like it's a practical implementation of the "pshufd" lookup table technique discussed in article I mentioned earlier (certainly the SSE4.2 implementation, which is using _mm_shuffle_epi8). Really great to see an actual Rust implementation of that.

My PR for identifiers is a total mess at present, so not worth reading really. But point is that it completely churns up the code that's in main branch at present, so it'd be difficult for us both to work on that area simultaneously.

By the way, in case you didn't notice, #2288 was merged today. That provides some more efficient APIs for reading source as bytes, which may come in useful.

@overlookmotel
Copy link
Collaborator

@sno2 Also seems to be working in this area, though his PR #2327 uses memchr (which uses SIMD internally) rather than a hand-coded implementation.

@overlookmotel
Copy link
Collaborator

overlookmotel commented Feb 5, 2024

PS: One thing I missed in algorithms quoted above:

0x0B and 0x0C (irregular whitespace) also need to be handled to create trivia for them. But, again, #[cold] path.

@dyxushuai
Copy link

dyxushuai commented Feb 6, 2024

httparse looks like it's a practical implementation of the "pshufd" lookup table technique discussed in article I mentioned earlier (certainly the SSE4.2 implementation, which is using _mm_shuffle_epi8). Really great to see an actual Rust implementation of that.

httparse also made a very interesting small optimization, be like(pseudocode):

while bytes.len() >= 32 {
	// use avx2 do the 32bytes parallel processing
	bytes.advance(32)
} 
// fallback to sse4.2
while bytes.len() >= 16 {
	// use sse4.2 do  the 16bytes parallel processing
	bytes.advance(16)
}
// fallback to register width parallel processing
while bytes.len() >= core::mem::size_of::<usize>() {
	// do parallel processing of register size.
	bytes.advance(core::mem::size_of::<usize>())
}

@overlookmotel
Copy link
Collaborator

overlookmotel commented Feb 6, 2024

httparse also made a very interesting small optimization, be like(pseudocode)

That's really smart!

I have also been wondering whether we could do something like what ratel does. At the start of parsing, it copies the entire source text into a buffer that's 1 byte longer than the source, and adds a null byte on the end to be a sentinel for EOF.

Then the lexer never has to bounds-check "have I reached the end?" before reading. As long as the EOF sentinel has not been read yet, then it's not at the end.

We could extend this approach: Instead of adding a single null byte to the end, add 32 x null bytes. Then when iterating through the source, there's always guaranteed to be enough bytes remaining to do a 32-byte read without any bounds check. So then no need for specialised handling of "less than 32, but a least 16 bytes remaining". There's always 32.

Side note: I'm not sure why ratel uses 0 as the sentinel. 0xFF could be better, as it cannot appear in valid UTF-8 strings, so it's an unambiguous sentinel value.

My questions about this approach are:

  1. Does the cost of copying the whole source text at the start outweigh the gain of removing bounds checks everywhere else?
  2. Could we add an entry point to the parser which takes a mutable String as source text? If that String already has 32 bytes of excess capacity, could apply this approach without the cost of copying the source text.
  3. For &str input, could we use page table tricks like slice_deque does to create an extended buffer, while only copying the last chunk of the source?
  4. Is there a way to build a safe abstraction which statically forces every part of the lexer to check for and handle the EOF sentinel? (rather than making the entire lexer a pit of unsafe)

For me at least, (4) is the biggest question mark.

Anyway... perhaps that's a side-track, but just wanted to mention it. The relevant parts of ratel, if you're interested:

https://github.com/ratel-rust/ratel-core/blob/master/ratel/src/lexer/mod.rs#L630-L653
https://github.com/ratel-rust/toolshed/blob/master/src/arena.rs#L266-L280

@dyxushuai
Copy link

add 32 x null bytes

Do you mean add padding bytes at the end of source text? And make the source text exactly a multiple of 32.

@dyxushuai
Copy link

I'm not sure why ratel uses 0 as the sentinel.

I think it's a habit that comes from C++.

@overlookmotel
Copy link
Collaborator

Do you mean add padding bytes at the end of source text? And make the source text exactly a multiple of 32.

Yes, exactly that. Except padded length doesn't need to be a multiple of 32, it just needs to be true that:

  1. Wherever you are in the source, there are always at least 32 bytes more (of either source, or padding).

If you want to do aligned reads, could add a bit more padding, to also guarantee:

  1. Wherever you are in source, if you round up the current pointer to next 32-byte aligned position, there are still at least 32 bytes more.

@Boshen
Copy link
Member Author

Boshen commented Feb 6, 2024

Previous attempt on padding the source:


http://0x80.pl/articles/simd-byte-lookup.html

I can help contact Wojciech Muła on twitter if we have questions to ask.

@overlookmotel
Copy link
Collaborator

overlookmotel commented Feb 6, 2024

Thanks Boshen. Good to know this has been tried before. I'll read through those issues. May be a dead end, but maybe it could make more sense now, if it improves potential for SIMD.

@Boshen
Copy link
Member Author

Boshen commented Feb 6, 2024

Yeah it makes more sense now, we can give it another try.

@dyxushuai
Copy link

@Boshen @overlookmotel Do you have a benchmark setup or framework?

@Boshen
Copy link
Member Author

Boshen commented Feb 6, 2024

@Boshen @overlookmotel Do you have a benchmark setup or framework?

The current setup is only observable with codspeeed on github linux machines.

I think we need another github repository for a setup similar to https://github.com/BurntSushi/memchr/blob/master/.github/workflows/ci.yml

We also need to use criterion instead of codspeed because codspeed only works on linux. I used to do the benchmark diff like this

- name: Run Bench on PR Branch
run: cargo benchmark --save-baseline pr
- name: Checkout Main Branch
uses: actions/checkout@v3
with:
clean: false
ref: main
- name: Install Rust Toolchain
uses: ./.github/actions/rustup
- name: Compile
run: cargo build --release -p oxc_benchmark
- name: Run Bench on Main Branch
run: cargo benchmark --save-baseline main

@sno2
Copy link
Contributor

sno2 commented Feb 8, 2024

I figured we should move conversations from #2327 here. @overlookmotel:

I'm currently working on a a re-write of lexing identifiers. It doesn't use SIMD, but is designed to be easily convertable to SIMD later: https://github.com/overlookmotel/oxc/blob/lexer-string5/crates/oxc_parser/src/lexer/identifier.rs
(please excuse the messy code, still very much a work in progress)

You're right about having to bust out to scalar if unicode chars are found, but approach I've taken is to speed through ASCII as fast as possible and then Unicode is a de-opt (and on a #[cold] path so compiler assumes it won't happen in most cases). This approach seems to work - produces about 60% speed-up on the Lexer benchmarks. Obviously using SIMD would be a further speed boost.

This seems pretty good. I have been playing around with implementing various SIMD and parsing optimizations for the skip_single_line_comment algorithm.

This uses some interesting code design helped by orlp in the Rust discord to avoid panic checking and let Rust optimize the assembly for the loops instead of dancing with raw pointers. Furthermore, it implements the SIMD-based shuffling and naive compare methods for 16-byte and 32-byte.

Interestingly enough, the most performant algorithms on my machine have been (1x) simd_naive, (1.34x ± 0.12) memchr, and (1.73x ± 0.09) simd_shuffle. This was a benchmark that just went through the comment bodies scraped from the React development file's source. My machine also does not have avx2, so I am unable to run the wide variants and I would expect the shuffle variant to beat the naive versions once we have more bytes to search.

Hopefully these same approaches can be adapted into the various parsing algorithms in oxc.

@overlookmotel
Copy link
Collaborator

overlookmotel commented Feb 8, 2024

Nice! I'm heading for bed now so will have to look at that tomorrow (ditto dyxushuai's draft PR #2338).

In meantime, just to share where I'm up to with identifiers as it overlaps with this effort:

I've now refactored byte searching into a generic "byte table search" macro. My intent is this same macro can be used for all other searches discussed here (except for multi-line comments which are different as you're looking for */, not just a single byte).

I've used a macro rather than a function, as found it's faster if all the code is in one function together (even #[inline(always) is a few % slower).

The macro already searches in 32-byte blocks, so I was wondering if, once it's merged, that macro could be SIMD-ized, by replacing the main "check a block of 32" loop with a SIMD implementation? Ditto ByteMatchTable - could it also construct the lookup arrays for SIMD shuffles at compile time?

If either of you have any suggestions of how to make the macro more amenable to that, please do let me know.

Macro: lexer/search.rs
Usage of macro for identifiers: lexer/identifier.rs#L49-L104

@dyxushuai
Copy link

dyxushuai commented Feb 8, 2024

Nice! I'm heading for bed now so will have to look at that tomorrow (ditto dyxushuai's draft PR #2338).

Thanks, this draft is still in its early stages. There are some major refactorings that need to be done, such as using u8 (ascii) as the delimiter type instead of char for SIMD-based character searching and so on.

@overlookmotel
Copy link
Collaborator

@sno2 I couldn't resist taking a look at your experiment (bedtime reading).

A few questions:

(1x) simd_naive, (1.34x ± 0.12) memchr, and (1.73x ± 0.09) simd_shuffle. I would expect the shuffle variant to beat the naive versions once we have more bytes to search.

Are you saying simd_naive was the fastest? Or slowest?

And how does SIMD compare to scalar? Does it move the needle significantly?

Out of interest, how do scalar and scalar_lookup compare?

This was a benchmark that just went through the comment bodies scraped from the React development file's source.

If you were testing the string of each comment individually, I'm not sure it'd be very representative. When parsing a large JS file, you will very rarely hit the "not enough bytes left, fallback to scalar" path. Whereas if you're feeding it each comment as an individual string, it'd hit that slow path every time for the end of the comment. (I may well have misunderstood here and that's not what you're doing)

Another thing... You might find there's a speed-up from marking the branches for 0xE2 as #[cold]. i.e. instead of:

if rest[offset_in_chunk] == XX {
  index += skip_single_line_comment_scalar(&rest[offset_in_chunk..]);
}
if rest[offset_in_chunk] == XX {
  #[cold]
  fn handle_e2(rest: &[u8], offset_in_chunk: usize, index: usize) {
    return index + skip_single_line_comment_scalar(&rest[offset_in_chunk..]);
  }
  return handle_e2(rest, offset_in_chunk, index);
}

We have to handle irregular whitespace for the linter, but in practice don't actually expect to find any 99% of the time, and the above communicates that to the compiler. Ugly as hell, but I've had success with it elsewhere in OXC.

Lastly, in the scalar lookup version, I think this:

match SCALAR_LOOKUP[byte as usize] {
  1 => return idx,
  2 => match &string[idx + 1..] {
    &[0x80, 0xA8 | 0xA9, ..] => return idx,
    _ => {}
  },
  _ => {}
}

could be:

match SCALAR_LOOKUP[byte as usize] {
  0 => {},
  1 => return idx,
  // This is UTF-8 so guaranteed 2 more bytes after 0xE2
  2 => match unsafe { string.get_unchecked(idx + 1..idx + 3) } {
    &[0x80, 0xA8 | 0xA9] => return idx,
    _ => {}
  },
  // This may allow compiler to turn this match into a jump table
  // (which may or may not be faster)
  _ => unsafe { std::hint::unreachable_unchecked() }
}

or again, use #[cold] on the match arm for option 2.

@overlookmotel
Copy link
Collaborator

One more thing (sorry):

let offset_in_chunk = mask.trailing_zeros() as usize;
assert!(offset_in_chunk < 16); // eliminate bound checks
index += offset_in_chunk;
if rest[offset_in_chunk] == XX {

How does the assertion eliminate the bounds check? Doesn't it just turn into a branch with a panic on it?

@sno2
Copy link
Contributor

sno2 commented Feb 8, 2024

Are you saying simd_naive was the fastest? Or slowest?

And how does SIMD compare to scalar? Does it move the needle significantly?

Out of interest, how do scalar and scalar_lookup compare?

simd_naive was the fastest. SIMD was around 3-5x faster thanscalar. lookup was consistently faster than scalar by a bit. I will get the CI/CD benchmarking thing working with more test cases so that we can see fair results.

If you were testing the string of each comment individually, I'm not sure it'd be very representative. When parsing a large JS file, you will very rarely hit the "not enough bytes left, fallback to scalar" path. Whereas if you're feeding it each comment as an individual string, it'd hit that slow path every time for the end of the comment. (I may well have misunderstood here and that's not what you're doing)

It goes through the source as a whole (like an iterator), so no.

Another thing... You might find there's a speed-up from marking the branches for 0xE2 as #[cold]. i.e. instead of:

I was unable to reproduce a speedup on my machine, so I took that out. More accurate benchmarking will allow me to see better how it affects the code.

Lastly, in the scalar lookup version, I think this:

Nice, I was also able to get the lookup table optimization for the match by using an enum as the values. Sadly, I think we do need unsafe for the faster 2-byte index equality, though.

https://godbolt.org/z/d6f3xd3fK

How does the assertion eliminate the bounds check? Doesn't it just turn into a branch with a panic on it?

I don't exactly know how, but I think the optimizer does back-checking whenever you assert something such as integer range checks. Then, it notices that rest must contain at least 16 bytes from the rest.len() >= 16 check.

@overlookmotel
Copy link
Collaborator

overlookmotel commented Feb 8, 2024

I don't exactly know how, but I think the optimizer does back-checking whenever you assert something such as integer range checks. Then, it notices that rest must contain at least 16 bytes from the rest.len() >= 16 check.

Interesting! I had no idea you could get it to do that.

Nice, I was also able to get the lookup table optimization for the match by using an enum as the values.

This is optimizing the wrong part, because 0xE2 never comes up anyway, but just for fun... https://godbolt.org/z/9Te1r3ve7

simd_naive was the fastest.

That's not what I expected at all. So maybe I was wrong about shuffle tables, at least for such a small number of "needles". But good to hear the gain from SIMD is significant. Sounds like this effort will be worthwhile. Please do post again when you have more numbers.

@overlookmotel
Copy link
Collaborator

overlookmotel commented Feb 8, 2024

@Boshen I just thought: Does the irregular whitespace linter rule need to catch irregular whitespace in comments too? Or can they be ignored?

What about in strings?

@Boshen
Copy link
Member Author

Boshen commented Feb 9, 2024

@Boshen I just thought: Does the irregular whitespace linter rule need to catch irregular whitespace in comments too? Or can they be ignored?

What about in strings?

I can trade for performance by removing irregular whitespace collection from the lexer.

I can implement the irregular whitespace rule differently.

@overlookmotel
Copy link
Collaborator

The new byte-search macro is now merged into main (#2352) and used to lex identifiers, and also some whitespace. PR to apply it to strings too is pending (#2357).

If anyone wants to SIMD-ize identifiers, please go ahead now. I'd also be interested to hear if you think the byte_search! macro and ByteMatchTable types are amenable to being extended to use SIMD.

@dyxushuai
Copy link

This PR #2338 has already supported Identifiers, Strings, Whitespace

@Boshen
Copy link
Member Author

Boshen commented Feb 20, 2024

Saw a relevant PR: https://github.com/pydantic/jiter/pull/65

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants