Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wip: integrate binary data into payloads #283

Closed
wants to merge 6 commits into from

Conversation

kelnos
Copy link
Contributor

@kelnos kelnos commented Mar 7, 2024

Previously, the non-binary part of a message and the binary payloads in a message were represented separately: the non-binary portion was represented by a serde_json::Value, and could be converted to an arbitrary data structure. That data structure would not include the binary data or any indication that there is any binary data at all. The binary data would be provided in a Vec<Vec>.

There were a few problems with this:

  1. The original code only supported cases where the payload was a flat array with some binary payloads in the root of the array, or a flat object where the root of the object was a binary payload. Objects with more complicated structure and binary data embedded in various places in the structure were not supported.
  2. Adding support for the above turned out to not be possible in a useful way, because the ordering of the Vec<Vec> matters, and it could never be clear where exactly in the possibly-complex structure each binary payload belonged.
  3. One of the useful features of the socket.io protocol is that it lets users efficiently transmit binary data in addition to textual/numeric data, and have that handled transparently by the protocol, with either end of the connection believing that they just sent or received a single mixed textual/numeric/binary payload. Separating the non-binary from the binary negates that benefit.

This introduces a new type, PayloadValue, that behaves similarly to serde_json::Value. The main difference is that it has a Binary variant, which holds a numeric index and a Vec. This allows us to include the binary data where the sender of that data intended it to be.

There is currently one wrinkle: serde_json does not appear to consistently handle binary data; when serializing a struct with Vec, I believe it will serialize it as an array of numbers, rather than recognize that it's binary data. For now, I've included a Binary struct that wraps a Vec, which can be included as the type of a binary member, instead of using a Vec directly. Hopefully I'll be able to figure out a better way to do this.

Unfinished tasks:

  • Testing: I have no idea if this even works yet. All I've done is get it to compile.
  • Benchmarking: I've tried to ensure that I don't copy data any more than the existing library does, but it's possible I've introduced some performance regressions, so I'll have to look into that.
  • Documentation: the documentation still references the old way of doing things and needs to be updated.

Closes #276.

socketioxide/src/payload_value/ser.rs Fixed Show fixed Hide fixed
socketioxide/src/payload_value/de.rs Fixed Show fixed Hide fixed
socketioxide/src/payload_value/de.rs Fixed Show fixed Hide fixed
socketioxide/src/payload_value/de.rs Fixed Show fixed Hide fixed
socketioxide/src/payload_value/ser.rs Fixed Show fixed Hide fixed
socketioxide/src/payload_value/mod.rs Fixed Show fixed Hide fixed
socketioxide/src/payload_value/mod.rs Fixed Show fixed Hide fixed
socketioxide/src/payload_value/mod.rs Fixed Show fixed Hide fixed
socketioxide/src/payload_value/mod.rs Fixed Show fixed Hide fixed
socketioxide/src/payload_value/binary.rs Fixed Show fixed Hide fixed
@kelnos
Copy link
Contributor Author

kelnos commented Mar 7, 2024

Hi @Totodore, here's what I have so far. I hate PRs that have so many changed lines, but the changes to the library itself are actually not that large. The first commit is all added lines: the addition of PayloadValue and its methods, serializers, and deserializers. The second commit is much smaller (+382 -334 lines), and adapts the library to use PayloadValue.

@kelnos kelnos force-pushed the integrated-data-binary branch 3 times, most recently from 0164625 to be9b501 Compare March 7, 2024 10:25
@kelnos kelnos force-pushed the integrated-data-binary branch 3 times, most recently from 2243d6a to 72115a0 Compare March 7, 2024 11:25
_: Option<i64>,
) -> Result<Self, Infallible> {
Ok(Bin(bin))
Ok(Bin(v.extract_binary_payloads()))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left in the Bin extractor, and preserved its behavior. Is it worth keeping?

  • There is an example in a doc comment where the Bin extractor is used, and then the binary payloads are emitted along with three different messages, separate from the data in two of them. Maybe that is a use-case that is useful to preserve?
  • A downside of keeping it is that Data would be much more efficient if it implemented FromMessage and not FromMessageParts, since with the latter we have to clone the entire thing.
  • If we do keep it, maybe Bin should instead implement FromMessageParts and clone the binary payloads rather than extracting them from the PayloadValue. Then Data could implement FromMessage, if we assume that the most common use case is that people will use Data and not Bin; that would make the common case more efficient, removing the clone of the data.

I'm leaning toward keeping it and implementing that last point above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One wrinkle: if I convert Data to FromMessage, I'm thinking about TryData, and maybe leaving it as FromMessageParts. Maybe someone wants to do a TryData, but if it fails, still get the data in another form in order to do something else with it. Like:

socket.on("test", |TryData(data): TryData<MyData>, Data(value): Data<PayloadValue>| async {
    if let Ok(data) = data {
        println!("Got data as expected: {:?}", data);
    } else {
        println!("Data deser failed, payload received was: {:?}", value);
    }
})

Or even implement a handler that can have the data deserialize into multiple possible known types:

|TryData(my_data): TryData<MyData>, TryData(other_data): TryData<OtherData>, Data(last_chance): Data<LastChanceData>|

Thoughts?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

fn on_handle(socket: SocketRef, ack: AckSender, Bin(bin): Bin, Data(data): Data<PayloadValue>) {
    info!("Received message: {:?}", data);
    ack.send(data.clone()).ok();
    socket.emit("message-back", data.clone()).ok();
}

The picture is the printed data, and the data received in on_handle is decoded from binary to string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I know things don't really work yet ;) I've already fixed this locally (seems String's default serde impl handles bytes, sigh), but am working on some other fixes as well before I push it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, you've worked hard.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left in the Bin extractor, and preserved its behavior. Is it worth keeping?

  • There is an example in a doc comment where the Bin extractor is used, and then the binary payloads are emitted along with three different messages, separate from the data in two of them. Maybe that is a use-case that is useful to preserve?
  • A downside of keeping it is that Data would be much more efficient if it implemented FromMessage and not FromMessageParts, since with the latter we have to clone the entire thing.
  • If we do keep it, maybe Bin should instead implement FromMessageParts and clone the binary payloads rather than extracting them from the PayloadValue. Then Data could implement FromMessage, if we assume that the most common use case is that people will use Data and not Bin; that would make the common case more efficient, removing the clone of the data.

I'm leaning toward keeping it and implementing that last point above.

I think the best would be to remove the Bin extractor and the bin() operator and have Data and TryData implement FromMessage at least for Data. I still don't know what we should do for TryData maybe we could start by only implement FromMessage and then later having a custom implementation of FromMessageParts that clones the data. But we still need to think about this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, removed Bin, and made Data implement FromMessage. I left TryData as FromMessageParts for now, but we can revisit that.

@kelnos kelnos force-pushed the integrated-data-binary branch 4 times, most recently from 8d18205 to 47ada89 Compare March 8, 2024 07:11
socketioxide/src/payload_value/de.rs Fixed Show fixed Hide fixed
socketioxide/src/payload_value/de.rs Fixed Show fixed Hide fixed
socketioxide/src/payload_value/ser.rs Fixed Show fixed Hide fixed
@kelnos kelnos force-pushed the integrated-data-binary branch 2 times, most recently from 7c191b8 to ddf322b Compare March 9, 2024 09:41
@kelnos
Copy link
Contributor Author

kelnos commented Mar 9, 2024

Did some benchmarking and compared runs with critcmp:

group                                              0.11.0-encode                          integrated-data-binary-encode
-----                                              -------------                          -----------------------------
Encode packet ack on /                             1.01    176.3±2.50ns        ? ?/sec    1.00    174.7±1.77ns        ? ?/sec
Encode packet ack on /custom_nsp                   1.00    183.0±2.07ns        ? ?/sec    1.00    183.2±3.36ns        ? ?/sec
Encode packet binary ack (b64) on /                2.12   483.7±15.09ns        ? ?/sec    1.00    228.1±2.51ns        ? ?/sec
Encode packet binary ack (b64) on /custom_nsp      2.03    488.8±6.23ns        ? ?/sec    1.00    240.6±7.92ns        ? ?/sec
Encode packet binary event (b64) on /              1.78    605.1±2.86ns        ? ?/sec    1.00    339.2±6.57ns        ? ?/sec
Encode packet binary event (b64) on /custom_nsp    1.77    612.0±6.06ns        ? ?/sec    1.00    345.5±3.81ns        ? ?/sec
Encode packet connect on /                         1.01     75.9±1.23ns        ? ?/sec    1.00     75.1±0.52ns        ? ?/sec
Encode packet connect on /custom_nsp               1.01     77.8±0.74ns        ? ?/sec    1.00     77.1±0.68ns        ? ?/sec
Encode packet event on /                           1.00    185.9±2.94ns        ? ?/sec    1.00    185.3±7.23ns        ? ?/sec
Encode packet event on /custom_nsp                 1.00    196.0±2.12ns        ? ?/sec    1.00    195.8±1.89ns        ? ?/sec
Encode packet event with ack on /                  1.01    186.6±1.56ns        ? ?/sec    1.00    183.9±2.07ns        ? ?/sec
Encode packet event with ack on /custom_nsp        1.00    194.1±2.64ns        ? ?/sec    1.00    193.6±2.08ns        ? ?/sec
group                                              0.11.0-decode                          integrated-data-binary-decode
-----                                              -------------                          -----------------------------
Decode packet ack on /                             1.00    382.2±2.05ns        ? ?/sec    1.15    439.9±1.95ns        ? ?/sec
Decode packet ack on /custom_nsp                   1.00    379.0±4.84ns        ? ?/sec    1.20    455.3±3.34ns        ? ?/sec
Decode packet binary ack (b64) on /                1.56    685.7±3.14ns        ? ?/sec    1.00    438.5±3.82ns        ? ?/sec
Decode packet binary ack (b64) on /custom_nsp      1.49    717.5±9.37ns        ? ?/sec    1.00   480.7±30.18ns        ? ?/sec
Decode packet binary event (b64) on /              1.44    827.9±7.43ns        ? ?/sec    1.00    573.4±5.78ns        ? ?/sec
Decode packet binary event (b64) on /custom_nsp    1.42   872.6±27.01ns        ? ?/sec    1.00    614.2±3.47ns        ? ?/sec
Decode packet connect on /                         1.00     82.1±1.50ns        ? ?/sec    1.11     91.0±3.12ns        ? ?/sec
Decode packet connect on /custom_nsp               1.00     86.6±1.34ns        ? ?/sec    1.18    102.3±2.12ns        ? ?/sec
Decode packet event on /                           1.00    478.6±5.45ns        ? ?/sec    1.11   530.1±20.90ns        ? ?/sec
Decode packet event on /custom_nsp                 1.00    506.7±7.16ns        ? ?/sec    1.11    562.6±3.95ns        ? ?/sec
Decode packet event with ack on /                  1.00    479.5±8.71ns        ? ?/sec    1.10    527.2±4.28ns        ? ?/sec
Decode packet event with ack on /custom_nsp        1.00    511.0±7.13ns        ? ?/sec    1.10    562.2±3.83ns        ? ?/sec

The encoding numbers look really good, improved in a few cases by 75%-110%. But decoding is a bit more mixed; a few large improvements of around 50%, but quite a few 10-20% regressions.

Since the decoding regressions are in the cases without binary payloads, my guess is that my PayloadValue implementation is less efficient than serde_json::Value. Will try to spend some time this weekend figuring out why.

This will act as a replacement for serde_json::Value that can also hold
binary payloads.
Previously, the non-binary part of a message and the binary payloads in
a message were represented separately: the non-binary portion was
represented by a serde_json::Value, and could be converted to an
arbitrary data structure.  That data structure would not include the
binary data or any indication that there is any binary data at all.  The
binary data would be provided in a Vec<Vec<u8>>.

There were a few problems with this:

1. The original code only supported cases where the payload was a flat
   array with some binary payloads in the root of the array, or a flat
   object where the root of the object was a binary payload.  Objects
   with more complicated structure and binary data embedded in various
   places in the structure were not supported.
2. Adding support for the above turned out to not be possible in a
   useful way, because the ordering of the Vec<Vec<u8>> matters, and it
   could never be clear where exactly in the possibly-complex structure
   each binary payload belonged.
3. One of the useful features of the socket.io protocol is that it lets
   users efficiently transmit binary data in addition to textual/numeric
   data, and have that handled transparently by the protocol, with
   either end of the connection believing that they just sent or
   received a single mixed textual/numeric/binary payload.  Separating
   the non-binary from the binary negates that benefit.

This introduces a new type, PayloadValue, that behaves similarly to
serde_json::Value.  The main difference is that it has a Binary variant,
which holds a numeric index and a Vec<u8>.  This allows us to include
the binary data where the sender of that data intended it to be.

There is currently one wrinkle: serde_json does not appear to
consistently handle binary data; when serializing a struct with Vec<u8>,
I believe it will serialize it as an array of numbers, rather than
recognize that it's binary data.  For now, I've included a Binary struct
that wraps a Vec<u8>, which can be included as the type of a binary
member, instead of using a Vec<u8> directly.  Hopefully I'll be able to
figure out a better way to do this.

Unfinished tasks:

* Testing: I have no idea if this even works yet.  All I've done is get
  it to compile.
* Benchmarking: I've tried to ensure that I don't copy data any more
  than the existing library does, but it's possible I've introduced some
  performance regressions, so I'll have to look into that.
* Documentation: the documentation still references the old way of doing
  things and needs to be updated.

Closes Totodore#276.
So far just the doctests that were failing.
@kelnos
Copy link
Contributor Author

kelnos commented Mar 10, 2024

Did some profiling, looked at flamegraphs, made a few changes, but the performance regressions are still there.

This is very strange, though. Decoding of Connect packets has regressed about 7%, but that makes no sense: Connect packets don't use ser/deser at all, and the code path for those packets hasn't changed at all. So I'm wondering if some surrounding code has caused the compiler to optimize things differently. If that's the case, I'm not sure what to do here.

@kelnos
Copy link
Contributor Author

kelnos commented Mar 10, 2024

I'm getting really confused by this, actually. Different runs of the benchmark compare very differently. For example, I just did a run where the Connect packet decoding showed as improved by 18% (whereas my prior run it had it regressing by 7%). And then even sometimes during the same run, a decode of a particular packet type will have a wildly different change percent from the decode of that same packet type but with an ack. I wonder if maybe my laptop isn't "quiet" enough to get good data; perhaps I'll try shutting down X and giving it another run.

@Totodore
Copy link
Owner

Totodore commented Mar 11, 2024

I'm getting really confused by this, actually. Different runs of the benchmark compare very differently. For example, I just did a run where the Connect packet decoding showed as improved by 18% (whereas my prior run it had it regressing by 7%). And then even sometimes during the same run, a decode of a particular packet type will have a wildly different change percent from the decode of that same packet type but with an ack. I wonder if maybe my laptop isn't "quiet" enough to get good data; perhaps I'll try shutting down X and giving it another run.

I have aproximately the same weird results (but on a github codespace container so it is not stable). Either the benchmark code is not working properly (but according to me it is correct code) or it is the machine.

Globally the difference is about +10% for the Connect packet decoding. But because the scale is really low and that the variance for connect packet decoding is quite high, I think it is still ok and that we can move on...

@kelnos
Copy link
Contributor Author

kelnos commented Mar 12, 2024

Ok, I quit X, ran a basline against main, then ran it against this branch, comparing with main, 3 times. Numbers are the average change from main:

Encode packet connect on /                        +0.3244%	+1.0224%	+0.2637%
Encode packet connect on /custom_nsp              +7.9232%	-0.1176%	-0.7743%
Encode packet event on /                          +1.6022%	+1.8639%	+2.0130%
Encode packet event on /custom_nsp                +2.6121%	+2.9042%	+3.2108%
Encode packet event with ack on /                 +2.3220%	+1.9855%	+2.2505%
Encode packet event with ack on /custom_nsp       +2.3728%	+2.6042%	+2.4713%
Encode packet ack on /                            +0.9427%	+1.0778%	+1.2282%
Encode packet ack on /custom_nsp                  +2.4452%	+3.6130%	+2.1004%
Encode packet binary event (b64) on /             -40.989%	-43.376%	-43.861%
Encode packet binary event (b64) on /custom_nsp   -41.039%	-43.742%	-43.640%
Encode packet binary ack (b64) on /               -51.311%	-51.461%	-51.580%
Encode packet binary ack (b64) on /custom_nsp     -50.145%	-50.608%	-50.198%
Decode packet connect on /                        -18.877%	-18.676%	-19.289%
Decode packet connect on /custom_nsp              +4.8328%	+5.2217%	+4.9167%
Decode packet event on /                          +1.8359%	+1.3146%	+1.9475%
Decode packet event on /custom_nsp                +5.2678%	+2.1481%	+2.6367%
Decode packet event with ack on /                 +1.7432%	+1.3348%	+1.3040%
Decode packet event with ack on /custom_nsp       +5.2353%	+3.6659%	+2.4317%
Decode packet ack on /                            +5.5802%	+4.6327%	+5.3401%
Decode packet ack on /custom_nsp                  +8.4783%	+8.4328%	+8.2965%
Decode packet binary event (b64) on /             -34.712%	-34.472%	-34.677%
Decode packet binary event (b64) on /custom_nsp   -30.051%	-29.580%	-29.752%
Decode packet binary ack (b64) on /               -36.868%	-37.087%	-37.149%
Decode packet binary ack (b64) on /custom_nsp     -34.869%	-34.986%	-34.482%

With one exception (Encode packet connect on /custom_nsp), it's consistent across all 3 runs. But there are still some oddities: "Decode packet connect on /" is somehow ~19% faster, even though that code hasn't changed at all from what's on main. And even stranger, "Decode packet connect on /custom_nsp" is 5% slower, even though a) it does almost the exact same thing as without the custom nsp, and b) that code also hasn't changed from what's on main.

Unfortunately I'm at a loss as to what to do, though. cargo flamegraph didn't show me anything obvious to target, and the ser/deser for PayloadValue is done in almost the exact same way as for serde_json::Value. (The one notable difference is that if integer map keys are used, my version's deserializer has to do an extra copy of the string... but all of the tests use string map keys, so that's not what's causing this here.)

One thing that's notable, maybe: PayloadValue is a 56-byte struct, whereas serde_json::Value is only 32 bytes. Not sure allocation overhead, or perhaps cache locality, might have something to do with it. The difference appears to be because serde_json::Map<String, Value> only 24 bytes, while HashMap<String, PayloadValue> is 56. Unfortunately serde_json::Map's methods are only defined when the value type is serde_json::Value, so i couldn't reuse it.

@kelnos
Copy link
Contributor Author

kelnos commented Mar 12, 2024

Oh, it looks like serde_json::Map is just a wrapper around either BTreeMap or IndexedMap (depending on if you compile with the preserve_order feature). If I swap out HashMap for BTreeMap, it shaves the size down, though not all the way: 48 bytes vs. serde_json::Value's size of 32.

We're not going to get all the way down to 32, but I'm surprised it's not down to 40 bytes: the Binary variant holds a usize and Bytes; those are 8 and 32 bytes, respectively, which should give us 40 bytes. I'm surprised it would require any padding for alignment, unless I'm forgetting something and it wants 16-byte alignment for things like usize.

At any rate, 48 bytes is still better than 56. I'll try to rerun the benchmarks later tonight and see if that changes anything. Feels like a bit of a long shot, though.

@kelnos
Copy link
Contributor Author

kelnos commented Mar 12, 2024

Ok, here are the same numbers as before (against the baseline of the main branch), but now with three more runs with BTreeMap as the holder in the PayloadValue::Object() variant, just like serde_json::Value::Object() uses.

                                                  HashMap	HashMap		HashMap		BTreeMap	BTreeMap	BTreeMap
Encode packet connect on /                        +0.3244%	+1.0224%	+0.2637%	+3.7118%	+0.4929%	-0.2489%
Encode packet connect on /custom_nsp              +7.9232%	-0.1176%	-0.7743%	+29.232%	-1.1302%	+14.978%
Encode packet event on /                          +1.6022%	+1.8639%	+2.0130%	+10.888%	+12.810%	+11.635%
Encode packet event on /custom_nsp                +2.6121%	+2.9042%	+3.2108%	+12.844%	+13.418%	+12.542%
Encode packet event with ack on /                 +2.3220%	+1.9855%	+2.2505%	+12.238%	+14.013%	+12.626%
Encode packet event with ack on /custom_nsp       +2.3728%	+2.6042%	+2.4713%	+12.288%	+13.965%	+13.041%
Encode packet ack on /                            +0.9427%	+1.0778%	+1.2282%	+12.757%	+15.188%	+12.443%
Encode packet ack on /custom_nsp                  +2.4452%	+3.6130%	+2.1004%	+13.417%	+11.715%	+11.726%
Encode packet binary event (b64) on /             -40.989%	-43.376%	-43.861%	-41.263%	-41.204%	-40.748%
Encode packet binary event (b64) on /custom_nsp   -41.039%	-43.742%	-43.640%	-41.187%	-41.331%	-40.808%
Encode packet binary ack (b64) on /               -51.311%	-51.461%	-51.580%	-50.663%	-50.812%	-49.995%
Encode packet binary ack (b64) on /custom_nsp     -50.145%	-50.608%	-50.198%	-49.477%	-49.555%	-49.184%
Decode packet connect on /                        -18.877%	-18.676%	-19.289%	-16.085%	-18.892%	-20.472%
Decode packet connect on /custom_nsp              +4.8328%	+5.2217%	+4.9167%	+4.4243%	+5.0582%	+3.3755%
Decode packet event on /                          +1.8359%	+1.3146%	+1.9475%	+2.2657%	+2.1780%	+2.7651%
Decode packet event on /custom_nsp                +5.2678%	+2.1481%	+2.6367%	+6.3334%	+3.2236%	+3.2511%
Decode packet event with ack on /                 +1.7432%	+1.3348%	+1.3040%	+2.2029%	+1.7505%	+4.1861%
Decode packet event with ack on /custom_nsp       +5.2353%	+3.6659%	+2.4317%	+3.6227%	+4.2471%	+8.0672%
Decode packet ack on /                            +5.5802%	+4.6327%	+5.3401%	+4.1947%	+4.3384%	+5.6753%
Decode packet ack on /custom_nsp                  +8.4783%	+8.4328%	+8.2965%	+2.8838%	+8.2279%	+6.6421%
Decode packet binary event (b64) on /             -34.712%	-34.472%	-34.677%	-34.424%	-34.413%	-32.392%
Decode packet binary event (b64) on /custom_nsp   -30.051%	-29.580%	-29.752%	-29.295%	-29.499%	-27.383%
Decode packet binary ack (b64) on /               -36.868%	-37.087%	-37.149%	-35.033%	-35.360%	-33.529%
Decode packet binary ack (b64) on /custom_nsp     -34.869%	-34.986%	-34.482%	-32.803%	-33.091%	-31.718%

I'm so confused. Somehow the performance when encoding non-binary events is much much worse, despite now using the same map implementation serde_json::Value uses. Decoding is... kinda the same?

@Totodore
Copy link
Owner

Ok, here are the same numbers as before (against the baseline of the main branch), but now with three more runs with BTreeMap as the holder in the PayloadValue::Object() variant, just like serde_json::Value::Object() uses.

I don't understand what you changed exactly for this benchmark. It would not suprise me that your custom implementation is slower than serde_json (it has been probably optimized to hell).
What we could do though is only delegate serde to PayloadValue with binary payloads and otherwise use official serde_json::Value for normal events.

Appart from that, I also discovered that serde_json::RawValue exists and that it could be really useful to only partially serde the input payload ([event, ...]) and deserialize the RawValue immediately to T later. But it could be the subject of another PR.

@Totodore
Copy link
Owner

I wonder if it would possible to use a custom with = "binary_deser" module for deserializing binary data rather than reimplementing entire a new PayloadValue. Because 90% of the code is basically a copy of serde_json::Value. What do you think?

@kelnos
Copy link
Contributor Author

kelnos commented Mar 12, 2024

I don't understand what you changed exactly for this benchmark.

I changed PayloadValue::Object(HashMap<String, PayloadValue>) to PayloadValue::Object(BTreeMap<String, PayloadValue>). That mirrors json_value::Value::Object(), which also uses BTreeMap. I was expecting better numbers after this change, but instead got worse numbers.

What we could do though is only delegate serde to PayloadValue with binary payloads and otherwise use official serde_json::Value for normal events.

Possibly... the downside there is that the public API would then be a little inconsistent. E.g. Packet::event() would continue to accept a serde_json::Value while Packet::bin_event() would take a PayloadValue. This would also complicate the internals quite a bit.

Appart from that, I also discovered that serde_json::RawValue exists and that it could be really useful to only partially serde the input payload ([event, ...]) and deserialize the RawValue immediately to T later. But it could be the subject of another PR.

Oh neat, I didn't know about that either. I'm not sure that would work, though, since you'd need some kind of heterogeneous array/list implementation that is serde-compatible; that is, the example in the RawValue docs works because there's a wrapper object with keys and values. Not sure how you'd represent something like:

[ String, RawValue(0), RawValue(1), ..., RawValue(n) ]

I wonder if it would possible to use a custom with = "binary_deser" module for deserializing binary data rather than reimplementing entire a new PayloadValue.

I assume you're talking about something that socketioxide would provide, but that users would add to their structs with a serde attribute? I don't think that would work, since the deserializer would also need access to the list of binary payloads, and there's no way to get extra data to a deserializer specified that way.

Another thing I'm thinking about is sticking with serde_json::Value, and writing a sort of "wrapper deserializer". Most of it would just delegate to Value's own deserializer implementation, but if we encounter a map with a placeholder in it, it will instead emit a Bytes. I'm not sure if that's actually possible to do, though.

@Totodore
Copy link
Owner

Another thing I'm thinking about is sticking with serde_json::Value, and writing a sort of "wrapper deserializer". Most of it would just delegate to Value's own deserializer implementation, but if we encounter a map with a placeholder in it, it will instead emit a Bytes. I'm not sure if that's actually possible to do, though.

That could be a good think to explore. For the moment I'm not fond of the entire replacement of serde_json because it is a lot of code and it will probably introduce a lot of complexity to the lib. That's why I won't merge all of this immediately. What we could do is either improve this PR by working on it or if you find completely different possibilities don't hesitate to create a new PR :).

Thanks a lot for all of your work and your time :)!

@kelnos
Copy link
Contributor Author

kelnos commented Mar 14, 2024

or the moment I'm not fond of the entire replacement of serde_json because it is a lot of code and it will probably introduce a lot of complexity to the lib.

Yes, I totally agree. This got pretty large and more complicated than I expected or hoped.

What we could do is either improve this PR by working on it or if you find completely different possibilities don't hesitate to create a new PR

For now I'll probably continue to work here on this PR. Even if I do something completely different I may just squash it all down.

I've mostly written ser/deser wrappers for serde_json::Value's serializer and deserializer that transparently handle placeholder objects. The serializer was very easy, but the deserializer is a bit more tricky. I'm having some issues with lifetimes (which I've temporarily gotten around by cloning, just so I can move forward to try to prove the concept; but I may need some help there at some point), but it's mostly compiling. Unfortunately, though, it's still more new code than I'd hoped, as some useful things (like serde_json::value::de::MapKeyDeserizlizer) aren't public, and writing a Deserializer impl isn't trivial.

There is one major downside to this approach: since Value is still the user-facing representation, we can't have a unified text+binary representation in the case where a user maybe doesn't know the full payload format and needs to use Value instead of some struct that implements Deserialize. So we'll still need the separate Data and Bin extractors. And some subtle things will change, like the Value passed will need to still include the placeholder objects (otherwise there's no way for the user to know where the payloads go), while the current code removes them.

But overall I think this is a better way to go than PayloadValue. I don't consider it a total waste of time, as I've learned a lot about how serde & serde_json work internally, and it's been fun to dig into that.

Thanks a lot for all of your work and your time :)!

You're welcome! Got into this because I needed it for a personal project of mine, and even though it's been a diversion from that project, it's been fun and educational!

@kelnos
Copy link
Contributor Author

kelnos commented Mar 15, 2024

Decided to do a new PR after all; see #284.

@kelnos kelnos closed this Mar 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants