Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential performance issue: using serde's generic sequence (de)serialization instead of serde's bytes support #204

Open
thomcc opened this issue Aug 14, 2020 · 0 comments

Comments

@thomcc
Copy link

thomcc commented Aug 14, 2020

I had been intending at one point to get y'all a PR for this as a way to dip my toes into rkv, which is a cool project. That said, recent events have made this unlikely, so I at least wanted to let you know of the issue.

Loading or persisting an instance of the Safe backend ultimately involves reading/writing instances of https://github.com/mozilla/rkv/blob/master/src/backend/impl_safe/snapshot.rs#L26-L36 and some other stuff to disk using bincode.

There's a performance issue here: In general, in serde serializing Vec<u8>, Box<[u8]> and &[u8] is surprisingly inefficient. This is because it can't specialize these for u8, and has to serialize the whole sequence generically. Serde actually has support for doing the efficient thing here, but you have to opt-in by accepting various visit_bytes methods on deserializers.

For example, by default serde will individually deserialize every byte like this:

fn visit_seq<V: SeqAccess<'de>>(self, mut visitor: V) -> Result<Buffer, V::Error> {
    let mut v = Vec::with_capacity(cautious::size_hint(visitor.size_hint());
    while let Some(b) = visitor.next_element()? {
        v.push(b);
    }
    Ok(Buffer(v.into()))
}

But they bytes visitors accept the whole chunk in one go, like this:

fn visit_bytes<E: de::Error>(self, v: &[u8]) -> Result<Buffer, E> {
    Ok(Buffer(v.into()))
}
fn visit_byte_buf<E: de::Error>(self, v: Vec<u8>) -> Result<Buffer, E> {
    Ok(Buffer(v.into()))
}

Which deserializes all bytes at once. Note that there's a helpful crate that provides wrappers that do this in part for you: https://github.com/serde-rs/bytes . It uses Vec<u8> for its owned type, and so if the distinction matters to you, you'll need to copy its impl and use Box<[u8]> instead. Its not much code..

It should be fully compatible with what you're doing now, only faster, but obviously you should test. My experience is that changing bincode code to use serde_bytes does not seem to impact the serialized format at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant