Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow data channels? #310

Open
cw-rterm opened this issue Jan 1, 2023 · 31 comments
Open

Slow data channels? #310

cw-rterm opened this issue Jan 1, 2023 · 31 comments

Comments

@cw-rterm
Copy link

cw-rterm commented Jan 1, 2023

First of all, great job on this library. As a long time WebRTC and Javascript/Typescript fan I have been searching for something like this for a long time. A library with no external dependencies, amazing!

One thing I noticed is that the data channels are quite slow, at least compared to the wrtc node library. Am I doing something wrong or is this a known issue?

I ran a quick 30 second test uploading an 800mb file, it did not upload the whole file in either case (although I am on fiber here). The results are as follows

  • wrtc: 70MB in 30 seconds
  • werift: 3.4MB in 30 seconds

I wouldn't consider the wrtc fast, although it does perform quite a bit better than werift in my quick test here.

The setup:

  • Digital Ocean Server and MacOS client (both on 1GB/s internet connections)
  • I have an "agent" on the server and a cli on my Mac, both running wrtc or werift on each side.
  • The file is on my Mac being sent to the Digital Ocean server (SFO2) and I am in Seattle, WA (not too far away)

The code:

  • To upload the file, I am sending chunks of various sizes. For the wrtc library I have to be quite careful because it crashes if the buffer fills up.
  • wrtc is the fastest when i send about 5000 byte chunks (which I have always found strange) to the datachannel
  • werift seems to like 65000 byte chunks (although 1mb chunks performed about the same)
  • I stop submitting to the buffer if it is over 16438932 bytes, and start again when the datachannel buffer drops below that number.
  • on the server side, I am monitoring the events right off the datachannel, to rule out any issues on that end. The server writes to the disk using Node stream, and based on the logs it takes a few milliseconds to write then waits again for the event to fire.

Thanks again for all the work on this library. Let me know if you do have any suggestions or if I can clarify more on the setup/code.

@tsightler
Copy link

tsightler commented Jan 12, 2023

I don't know the answer to your question, but my project uses werift via a dependency on another project and, since the switch to werift, I've struggled to determine if it is fast enough. The streams I'm processing are about 3Mbps, and it seems like it can barely keep up. Still, I think that's better than the results you posted.

Basically, I've found werift to be incredibly reliable and overall compatible, but performance, that's an entirely different story. I've tried to spend some time with the code and, honestly, I didn't see a lot of places where it could be improved, but I'm not a very experienced Javascript developer and I'm in way over my head. However, the overall packet path is pretty straightforward. When I did some performance traces the biggest overhead seemed to be in the cost of decrypting SRTP using CTR encryption because it requires a call to createDecipheriv() for every packet, which just seems costly no matter what.

Summary, it's an amazing project, I've learned a ton from it, but I'm just not sure if Typescript/Javascript is fast enough to really be a streaming engine for hard core use. That being said, between my project and other projects built using the dependency, there are literally 10,000+ users, so I guess that alone says it's good enough for that case, even if just barely. On the other hand, if I needed a higher bitrate, I don't think it would work, except perhaps on higher end hardware, although, even there, I struggle to get much performance. On Linux, increasing the UDP buffer at the OS level helped a good bit, but it's not always possible (for example, when running in Docker or without privileges). I really wish NodeJS had native WebRTC functionality.

@koush
Copy link
Collaborator

koush commented Jan 12, 2023

For what it's worth, I'm using werift to handle multiple 4k video streams (16Mbps) with no issue (which uses rtp). However, the datachannels do seem slow. I believe this is due to either the sctp implementation or sctp itself (even reference implementations are around 5% of the performance of tcp). I haven't looked into it as much yet.

The rtp path also creates a decipher per packet, but that doesn't seem to slow things down. Runs great on even a pi 4.

https://demo.scrypted.app/#/demo

@tsightler
Copy link

Yep, I'm pretty familiar with Scrypted, have also run it myself. I think we might just have different definitions of "runs great", I would consider it as "runs sufficiently", probably as good as could be done with JS. I'm probably just comparing it with the Javascript RTP implementations I used prior to WebRTC, where the same streams used 50-65% less CPU for the same bitrate. The traces I took comparing the two were pretty clear that the difference was almost exclusively due to additional decryption overhead. But compared to something native, werift isn't a great performer, that's not really a fault of werift itself though.

@koush
Copy link
Collaborator

koush commented Jan 12, 2023

The de(cipher) functions all uses openssl/native code under the hood, so I'm a little skeptical that's actually the case?

I'm not saying there aren't performance issues (there are), but I think the diagnosis as to them being in the crypto process is unlikely.

@tsightler
Copy link

Fair enough, I didn't say that was for sure the issue, it was just the only obvious differences in the traces compared to the previous RTP code I was using, and it was very clear that a lot more time was spent in the decrypt function, and that seemed to explain the overall significantly higher CPU usage (to be fair, it was completely different cipher that was overall less complex in general).

I understand that native code is still used of course (in my daily job I work with C++ code that encrypts data at 100's of MB/s), but any time you have to initiate a new decipher object for every object being decrypted vs using an already existing decipher object, it's quite noticeable (some synthetic benchmarks show 100x faster for the latter), so, even though it's only happening perhaps a few dozen times/second, it's still measureable. I believe this is clearly the cause of the extra CPU overhead, but, perhaps not that impactful to the overall performance of the pipeline. But I admittedly didn't keep digging as I didn't really have any control over it and have just tried to optimize my side of the code as much as possible.

@koush
Copy link
Collaborator

koush commented Jan 12, 2023

@tsightler incidentally, are you primarily sending or receiving data with werift?

@tsightler
Copy link

tsightler commented Jan 12, 2023

Receiving. Basically, my use is via ring-client-api, receiving streams from Ring and piping them out via RTSP. Overall, werift works fairly well with this use case, but CPU use is pretty high compared to the bitrate of the streams being processed, and I don't find performance amazing.

@koush
Copy link
Collaborator

koush commented Jan 12, 2023

Ah gotcha. I implemented the non-BUNDLE transport support in werift for ring-client-api, and have an RTSP gateway for Ring in scrypted as well. My primary werift use case is sending streams with Scrypted, so I never did a performance pass on receiving webrtc streams. A quick glance at decryptRtp on CTR encrypted sessions (which ring uses) and it seems some optimizations could be made:

  decryptRtp(cipherText: Buffer, rolloverCounter: number): [Buffer, RtpHeader] {
    const header = RtpHeader.deSerialize(cipherText);

    // this isn't necessary, allocate what is needed immediately or concat below.
    let dst = Buffer.from([]);
    dst = growBufferSize(dst, cipherText.length - this.authTagLength);

    cipherText = cipherText.subarray(0, cipherText.length - this.authTagLength);

    // unnecessary
    cipherText.subarray(0, header.payloadOffset).copy(dst);

    // this allocates per packet buffer that is immediately tossed
    const counter = this.generateCounter(
      header.sequenceNumber,
      rolloverCounter,
      header.ssrc,
      this.srtpSessionSalt
    );
    const cipher = createDecipheriv(
      "aes-128-ctr",
      this.srtpSessionKey,
      counter
    );
    // use subarray which avoids a copy
    const payload = cipherText.slice(header.payloadOffset);
    const buf = cipher.update(payload);
    // no copy needed, use concat below
    buf.copy(dst, header.payloadOffset);

    // dst can be a concat rather than a series of copy calls.
    return [dst, header];
  }

ring-client-api also uses rxjs to deliver the packets to the api consumer, which is likely adding a 2 to 20 layers onto the call stack per packet. I don't use it for video processing for that reason and leverage werift directly. https://github.com/dgreif/ring/blob/7fdeafdd853b55be59b7b729a3fb79c33485b58f/packages/ring-client-api/streaming/streaming-session.ts#L154

@koush
Copy link
Collaborator

koush commented Jan 12, 2023

For what it's worth, I found ring's webrtc implementation underwhelming/unreliable compared to SIP. So I added that back in in a forked ring-client-api. I suspect Amazon is running a SIP to webrtc gateway on their backend. I still use Webrtc for ring edge devices, as that is the only option (and provides a local transport).

@tsightler
Copy link

tsightler commented Jan 12, 2023

I've been recently playing with using werift directly exactly for that reason (rxjs) but, so far, I haven't found that to be a huge part of the overall performance issue. I've actually considered just re-implementing SIP in my side of the code as well, but instead I actually split the werift streams into dedicated worker threads so that they can at least be scheduled across all cores independent of the main thread. That has managed to get similar-ish overall performance, just with much higher overall CPU usage.

However, I did some hackish test with golang and Pion, and the performance was so much better, and lighter on CPU and RAM, it makes me think I just need to spend some time to try to learn that well enough to write something usable. But, as this is just a side hacking project, I'm not sure when or if I'll ever get that much time.

@koush
Copy link
Collaborator

koush commented Jan 12, 2023

Taking a look at the js profiler, the werift nack loop is unnecessarily running 200 times per second, even when there is no packet loss. The top anonymous hit:

image

  private start() {
    if (this.nackLoop) {
      return;
    }
    this.nackLoop = setInterval(async () => {
      try {
        await this.sendNack();
      } catch (error) {
        log("failed to send nack", error);
      }
    }, 5);
  }

@koush
Copy link
Collaborator

koush commented Jan 12, 2023

but instead I actually split the werift streams into dedicated worker threads so that they can at least be scheduled across all cores independent of the main thread

Using separate worker threads is what I do in Scrypted as well, it schedules nicer with multiple cores and multiple streams.

@koush
Copy link
Collaborator

koush commented Jan 12, 2023

Fixed the nack handler to use a resetting timeout when a nack actually needs to happen.

image

@cw-rterm
Copy link
Author

I will be honest, most of this is over my head, but I do spend my days as a Typescript/Javascript dev, so will follow along where I can. I am avid fan of everything WebRTC though!

The one optimization I was able to find (by poking around the code base) was adjusting the USERDATA_MAX_LENGTH constant in werift-webrtc/packages/sctp/src/sctp.ts. It was set to 1200 and adjusting to 5500 gave me a good boost (about 3-4x improvement) in speed, but still quite far away from the performance of the wrtc node library (I can't wait to get rid of the wrtc library and its node-gyp dependencies (sigh))

My theory on why updating the constant worked is as follows. In my app I grab a file off the local computer and read with fs library. I break up the binary in to 5500 byte chunks and send to the datachannel. When the USERDATA_MAX_LENGTH was 1200, it was then breaking up the packets again before actually sending the file over the network.

Interestingly enough, if i adjusted the USERDATA_MAX_LENGTH constant too high the "performance boost" went away.

Thanks again for all the work here, truly impressive stuff!

@koush
Copy link
Collaborator

koush commented Jan 12, 2023

wrtc node library (I can't wait to get rid of the wrtc library and its node-gyp dependencies (sigh))

I have a wrtc fork that contains prebuilts for every platform so you can avoid the full webrtc build process. There was some work to get it merged into the official repository but it seems abandoned.

https://github.com/koush/node-webrtc
node-webrtc/node-webrtc#714

Performance is very good, although it always uses libav under the hood (does not support straight packet transfer). And there were some bugs with receiving streams (sending worked fine), that I abandoned it in favor of this. You are correct, the node-gyp build is a nightmare.

node-webrtc/node-webrtc#714 (comment)

@cw-rterm
Copy link
Author

cw-rterm commented Jan 13, 2023

Oh wow, very nice. The two platforms that I would really like (ideally I can get there with this package) are alpine linux and Apple Silicon/Arm (M1/M2, etc). wrtc worked with Rosetta, but that doesn't quite feel right.

I should also note that even wrtc datachannel speeds tap out around 3Mbps (at least for me). Pion and others (some ORTC libraries) report much higher numbers. I attempted to get Pion working with WASM, but that didn't quite go as planned. I didn't put much behind it; after I found this library it felt like a better fit for my specific project/use case.

Yes, I do hope to move to this package, it solved a lot of issues actually. For example, with wrtc the peer that opens the data channel doesn't get the open event to fire for the datachannel. There are other bugs that I had to work around over the last few years/months.

@cw-rterm
Copy link
Author

Fixed the nack handler to use a resetting timeout when a nack actually needs to happen.

image

Is there any way to try this out or get a code sample to adjust things locally?

@tsightler
Copy link

I had hacked around a little with this wrtc build:
https://www.npmjs.com/package/@cubicleai/wrtc

But I don't really want to do all the work to move because I've also found it quite buggy and it seems mostly abandoned, also seeing that werift mostly works now, just not with ideal performance, it's hard to be motivated to do the work to implement something else entirely.

@tsightler
Copy link

Taking a look at the js profiler, the werift nack loop is unnecessarily running 200 times per second, even when there is no packet loss. The top anonymous hit:

Interesting find, in the code that I have installed (0.17.5) it appears it's only 50 times per second (not that this is good), and that appears to be a very recent change. Here's what it looks like 0.17.5:

  private start() {
    if (this.nackLoop) {
      return;
    }
    this.nackLoop = setInterval(async () => {
      try {
        await this.sendNack();
      } catch (error) {
        log("failed to send nack", error);
      }
    }, 20);
  }

And it was commit 5aded34 where it was change to 5 just two weeks ago. Clearly still should be fixed, but I guess that's why I had not noticed this on any profiles I had run.

@cw-rterm
Copy link
Author

Oh nice! Where is 0.17.5? I only see 0.17.0 in releases/tags. Do i need to look through some commit history and the package.json files?

image

@tsightler
Copy link

The current version in develop branch on Github is 0.17.6, see https://github.com/shinyoshiaki/werift-webrtc/blob/develop/packages/webrtc/package.json

It's also published via NPM, you can see all NPM published versions here:
https://www.npmjs.com/package/werift?activeTab=versions

The ring-client-api code is currently locked on 0.17.5 so, even though 0.17.6 is out, I still have the slightly older version installed.

@koush
Copy link
Collaborator

koush commented Jan 13, 2023

I'm working off the primary forked development branch: the 5ms is there:

Not sure what's in the published release.

Here's what's in my perf improvements branch thus far. koush@c80d487

@tsightler
Copy link

Thanks very much @koush, I really appreciate the work to optimize werift as much as reasonably possible.

@koush
Copy link
Collaborator

koush commented Jan 13, 2023

The other issue I've found confirms that werift is spending a lot of time in the actual encryption call, which are synchronous calls that could be asynchronous and multithreaded (but node does not provide an API for this). aes-gcm in SRTP lends itself well to mutithreading since it's not stream based. Keyframes can be 300k or more at times, which is 200 encrypt invocations. While moving this into a worker pool will smooth things out on main thread contention, it won't improve the actual CPU time used.
I suspect that the js marshaling to node native code and openssl 200 times uses a good chunk of CPU time. Doing this as a bulk transaction would likely improve things but is not possible without a custom native dependency. I'm also uncertain if node's openssl has hardware acceleration for OpenSSL EVP.

@tsightler
Copy link

Well, at least I wasn't totally off base! I had thought of the idea of pre-allocating decrypt objects asynchronously in another worker thread since the counter used for the IV is predictable. I thought that might at least improve on the current in-line allocation, but I questioned if it would really help since, as you noted, it wouldn't really lower the amount of total processing.

@koush
Copy link
Collaborator

koush commented Jan 13, 2023

pre-allocating decrypt objects asynchronously in another worker thread since the counter used for the IV is predictable

I believe OpenSSL EV functions allow resetting the IV (ie, reusing the crypto session), but that's not in the node api unfortunately.

@cw-rterm
Copy link
Author

Been messing around (what I think is) the send/receive code today (/packages/sctp/src/sctp.ts), mainly logging to see what (if anything) I might find. The one thing I did notice is the queue does a "burst" and waits until the end before sending the next frame of packets. Is this a requirement or expected or is there a way to gently add to the queue as packets are sent (not waiting for an entire batch to send).

@koush
Copy link
Collaborator

koush commented Jan 14, 2023

I'm not familiar with sctp implementation details but there definitely seems to be a 3 second sleep in there.

this.timer3Handle = setTimeout(this.timer3Expired, this.rto * 1000);

@Zubnix
Copy link

Zubnix commented Feb 26, 2023

I'm still experiencing quite horrible datachannel performance with the currently released version. Looking at the implementation I noticed an excessive use of of async/await stemming from the use of (an effectively needless) async send in the udp ice transport. Investigating further I found that the timer3Handle (and related usage) for some reason makes the whole send stream stutter heavily.

As a test I removed the async/awaits from the sctp implementation, and completely removed the timer3handle. In my own project and subjective testing I noticed that sending is now faster than node-webrtc or node-datachannel. Probably because removing this removes sctp's absolutely horrible congestion control & resend mechanism I guess (which I don't need for my use-case anyway).

Diff: develop...Zubnix:werift-webrtc:better-datachannel-perf in case people are interested.

@cw-rterm
Copy link
Author

Amazing! Can't wait to try out the update

@cw-rterm
Copy link
Author

cw-rterm commented Mar 3, 2023

I tried to do this today, but ran in to all kinds of problems with installing the module. It seems that even npm install werift is broken for me.

image

To try the patch i tried npm i git+https://github.com/shinyoshiaki/werift-webrtc.git#c18af7100212c8b94ee469b780e4328f33fdb438

Likely something simple on my end, but do i need to do anything in the node_modules/werift folder after this? yarn and/or yarn build.

trying to run yarn build has its own errors

image

anyway, look forward to trying it out once i can get these sorted. If any one has tips, i would appreciate it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants