Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection was interrupted while the page was loading #439

Open
jwr opened this issue Oct 31, 2023 · 24 comments
Open

Connection was interrupted while the page was loading #439

jwr opened this issue Oct 31, 2023 · 24 comments
Assignees

Comments

@jwr
Copy link

jwr commented Oct 31, 2023

I am looking for help/advice, because I can't track this problem down. Some of my users started running into problems and can't log into the application. I can't reproduce the issue and it's relatively rare: so far two reports out of several thousand worldwide users. It doesn't seem to be browser-related, because these users see problems in both Chrome and Firefox, and on Windows and Linux.

The visible symptom is that there are these messages in the Firefox console:

The connection to wss://... was interrupted while the page was loading.

Here is a screenshot of the Firefox console, from a user. This shows the login flow: there is a 101 request establishing a socket connection (this seems to succeed), then a login POST request, which normally results in the client calling sente/chsk-reconnect! and a new websocket being established. But I have no idea why there are four websocket connection attempts in the screenshot, nor why the errors happen.

BsxXq3HtquajEcYUUuTeCBlkVXZcI4VxJw

And this is the normal, expected flow:
SCR-20231031-lasa

This is rare and not reproducible for me. So far the reports are from users with noticeable latencies (450ms and 1.2s RTT). But I can't narrow it down further. The initial server->client message is fairly large for these users, but not the largest (and I've already increased the :max-ws parameter in http-kit). I increased the sente timeout values for messages (though I don't think I can control any timeouts on sente/chsk-reconnect!.

The same failure happens in Chrome, although I don't have a console screenshot.

Any hints or ideas would be much appreciated.

@ptaoussanis ptaoussanis self-assigned this Oct 31, 2023
@ptaoussanis
Copy link
Member

@jwr Hi Jan, sorry to hear about the trouble!

Thanks for the detailed report, the screenshots are helpful.

No obvious ideas come to mind yet, though I'm happy to dig more. In the meantime, you mentioned:

Some of my users started running into problems and can't log into the application.

So it sounds like this is a recent development. Has anything possibly relevant changed recently? E.g. updated Sente, http-kit, Ring middleware, etc. If so, that could help narrow down where to look.

@jwr
Copy link
Author

jwr commented Oct 31, 2023

That's the first thing I checked. Absolutely nothing changed server-side recently, and that includes not just the app and libraries, but also the entire server stack. I even rebooted the servers just to be sure. There were no recent changes in the app (ClojureScript) either.

The things that were changing that I know of: user database sizes (the database gets sent after the initial connect, so that can be a factor) as these keep increasing constantly, and obviously user machines (system updates, browser updates, etc).

The database sizes were the first thing I suspected, but one of the users is able to log in and work normally on one machine, but not on several others.

As for browsers and operating systems, these are entirely out of my control and they do change frequently. But if users report problems on multiple machines and browsers, including some that were not updated recently, I would probably look elsewhere.

In other words, trying to track down what changed doesn't lead me anywhere, so I tried understanding where the error message comes from and what produces it. Unfortunately, that search wasn't fruitful either: most other reports I found were old and/or unrelated.

One thing that seems to appear is a 45s interval between retries. I am not sure where that comes from. Could sente have internal timeouts that come into play? But then again, some of my users sometimes load data longer than 45s, so I would have seen this problem earlier. By this point I even started suspecting networks and firewalls along the way.

The only approach I can think of right now is trying to understand "how could this possibly happen".

@ptaoussanis
Copy link
Member

Absolutely nothing changed server-side recently, and that includes not just the app and libraries, but also the entire server stack. I even rebooted the servers just to be sure. There were no recent changes in the app (ClojureScript) either.

Okay, great - thanks for confirmation. That's the ideal situation and should help tree-shake possibilities :-)

One potential explanation that comes to mind then is that something's changed with browser behaviour, but would need to dig further to advise on the likeliest causes.

Do I understand correctly that this isn't head-on-fire urgent? If so, I'll aim to investigate further tomorrow and will update you.

@jwr
Copy link
Author

jwr commented Oct 31, 2023

Thank you for offering to help! No, this is not a total showstopper, because it doesn't affect all connections, just a select few.

I don't think changes in browser behavior are to blame. I gathered information from one of the affected users and:

Location A:
Linux, Firefox 118.0.2: doesn't work
Windows 10, Firefox 116.0.3: doesn't work

Location B:
Linux, Firefox 102.0: doesn't work

Location C (network connectivity via Starlink):
Windows 10, Firefox 118.0.2: works fine
Windows 10, Edge 118.0.2088.76: works fine

This would seem to indicate something network-related or timing-related. But I don't even understand what the error message means: "connection interrupted while the page was loading" — does this mean a connection was established and then interrupted? Was the interruption unexpected or caused by a timeout in the browser? I'm completely baffled.

@jwr
Copy link
Author

jwr commented Oct 31, 2023

Thanks to the kindness of my users, I now have traces of a normal page/app reload and a failed one, in Chrome. There are no console messages appearing in Chrome, but something happens to the websocket, too.

This is what should normally happen:

Starlink - Reload - Worked

(the long wait is normal: loading a database can easily take a minute, it took about 46s in this case it seems)

Now, this is how a failed page/app reload looks:

Cellular - Reload - Failed

The only difference between these two screenshots is the network. Both were taken on the same (Windows) PC with Chrome, minutes apart, the failed one was over a cellular hotspot connection, and the working one was over a Starlink connection.

@jwr
Copy link
Author

jwr commented Oct 31, 2023

After some more debugging, deploying versions with extended logging, and a number of tests: it appears that it is Sente that is killing the websocket connections, specifically because of a ws-ping timeout. On one hand, I feel stupid, because that sounds obvious. On the other hand, my mental model of how ws-ping works was different (I thought they get sent only on an idle connection, not while waiting for a response), and I thought ws-ping had a default 5s timeout.

I still don't understand why this happens, where the (roughly) 45s timeout comes from (I relied on sente defaults, and I can't find a 45s value anywhere, only 5s), why this issue doesn't affect many more users, or why it recently started affecting the two users that reported it. But I do know right now that adding an explicit :ws-ping-timeout-ms 360000 parameter to my sente/make-channel-socket-client! call makes things work for these two users.

@ptaoussanis
Copy link
Member

Hi @jwr, thanks for all the extra info - will investigate now and come back to you 👍

@ptaoussanis
Copy link
Member

ptaoussanis commented Nov 1, 2023

Will address all your questions, just want to confirm a few details so long.

  • The most important problem is that some users could not log in, correct?

  • As part of normal operation, your clients may make a request that'll take the server ~45+ seconds to reply to. Is that correct?

  • This slow request has been getting slower over time due to increased payload size and/or server-side work related to the request. Is that correct?

  • What's the expected cause of the slowness? Network transfer because the payload is very large, or because the server needs to spend time preparing the payload?

  • Roughly how large is the payload?

  • Does the slow request happen automatically right after login, or close to it? How close?

Edit to add:

  • Are you running http-kit behind a proxy of some kind (e.g. nginx, etc.)? If so, what? And have you ruled out possible timeouts or other config on the proxy's side?

Thanks!

@danielsz
Copy link
Collaborator

danielsz commented Nov 1, 2023

Without entering the thick of it, I would like to point the finger at carrier-grade NAT, a practice by ISPs that allows them to share small pools of public addresses among many end users. Some time ago, I was observing constant interruptions of websocket connections. I called my ISP and they told me that they were aware of the problem, explained it was due to carrier-grade NAT, and upgraded my subscription. No more carrier-grade NAT, now I get assigned a public IP address that is not shared (still dynamic though). There were no more interruptions with websockets connections.
In brief, I would first work out with the affected users whether their ISPs implement carrier-grade NAT and take it from there.

@ptaoussanis
Copy link
Member

@danielsz Hi Daniel, thanks for the extra data point.

Could I ask you to please create another issue describing your experience in more detail? For example - did you see this specifically affect Sente, how did it manifest, etc.? I've not heard of the phenomenon before so any pointers you can give would be handy.

In principle even if it's caused by an ISP - I'd still consider something like that to be a Sente bug since Sente needs to be able to ~gracefully work around problems it's likely to encounter in the real world.

@danielsz
Copy link
Collaborator

danielsz commented Nov 1, 2023

Sure, I'll try but that was a long time ago and I can't reproduce the problem to check additional details.

@ptaoussanis
Copy link
Member

@danielsz Understood, but however little you can remember would be helpful - and it'd be nice to at least have a dedicated issue open so that if anyone else encounters the same thing we can start collecting experiences in one place.

@jwr
Copy link
Author

jwr commented Nov 1, 2023

  • The most important problem is that some users could not log in, correct?

Yes. Although I would not focus on the "log in" part too much — for example the Chrome traces above were for reloads of an authenticated session, so "logging in" did not factor into it.

  • As part of normal operation, your clients may make a request that'll take the server ~45+ seconds to reply to. Is that correct?

Yes. Specifically, when the app loads and there is an authenticated session (so, after log in, or if the session is already authenticated), there will be a "data load" request with a large response, and that can take anywhere from single seconds to more than a minute.

  • This slow request has been getting slower over time due to increased payload size and/or server-side work related to the request. Is that correct?

Yes. Although it is even slower for some other users. So, it isn't like these users crossed a threshold, and the others did not.

  • What's the expected cause of the slowness? Network transfer because the payload is very large, or because the server needs to spend time preparing the payload?

Both, really.

  • Roughly how large is the payload?

That's hard to estimate right now — my guess would be hundreds of kB to single megabytes.

  • Does the slow request happen automatically right after login, or close to it? How close?

Right after login, or when initializing the app, requesting data is one of the first things the app does.

  • Are you running http-kit behind a proxy of some kind (e.g. nginx, etc.)? If so, what? And have you ruled out possible timeouts or other config on the proxy's side?

Yes, it's behind nginx, and I've looked at the timeouts there, but cannot find anything that would be applicable.

Also, from what I understand from the logs after improving logging, it seems that it is Sente that is closing the connection. That's what the "Client ws-ping to server timed-out, will cycle WebSocket now" message would indicate, right?

As to @danielsz's comment, there might be something to it. I had another customer who reported the same problem several weeks ago. I couldn't help him much, but he contacted his ISP, and they changed something, which caused things to work for him again. That would fit the "ISP NAT breaking websocket connections" hypothesis. I don't think this is what we're looking at in these specific two cases, but I think in general this is something that can happen.

EDIT: Also, I've been told that my app does not work in China, from behind their firewall. I haven't investigated this.

@ptaoussanis
Copy link
Member

ptaoussanis commented Nov 1, 2023

Edit to add: I sent this before seeing your latest response, please ignore anything irrelevant.

@jwr Hi Jan, to update from my side:

  • I've looked through your report in detail, and refreshed my memory on the relevant code.
  • My current leading hunch is that there's something going on related to that slow (~45s) request.
  • A few ideas come to mind for how that could be problematic in certain cases. I've asked you some questions that should help me narrow it down.
  • Daniel has also floated an idea described here, though I don't believe there's currently evidence to suggest that as a cause in this particular case afaik. I can dig into this possibility more if/when we eliminate other possible causes like the slow request.
  • I'll pause my investigation for now, but will monitor for your response to my questions and continue when I hear back.

In the meantime, to answer a few of your own questions:

One thing that seems to appear is a 45s interval between retries. I am not sure where that comes from. Could sente have internal timeouts that come into play?

I can't think of any obvious source of a 45s interval if you're using Sente's defaults.

The relevant Sente timers I can think of would be:

Server-side:

  • :ws-kalive-ms (default: 25s)
  • :ms-allow-reconnect-before-close-ws (default: 2.5s)
  • :ws-ping-timeout-ms (default: disabled)

Client-side:

  • :ws-kalive-ms (default: 20s)
  • :ws-ping-timeout-ms (default: 5s)

The general logic of the ping behaviour is:

Server-side:

  • Setup a keep-alive loop for each new connection opened
  • Check every ws-kalive-ms if there's been connection activity
  • Noop if yes
  • Otherwise attempt to send server->client ping
  • If ws-ping-timeout-ms is non-nil, allow ws-ping-timeout-ms time for a pong, otherwise close connection

Client-side:

  • Setup a keep-alive loop for each new connection opened
  • Check every ws-kalive-ms if there's been connection activity
  • Noop if yes
  • Otherwise attempt to send client->server ping
  • Allow ws-ping-timeout-ms time for a pong, otherwise close connection

In other words both the server and client will:

  • Monitor for regular connection activity
  • When there's prolonged idleness, try to ping the other side
  • Disconnect if no pong response

Now it's possible you're seeing 45s as a result of some interaction of other timers - but nothing intentional/obvious comes to mind, so I'd first want to rule out timers from other layers in your stack (e.g. nginx, etc.).

On the other hand, my mental model of how ws-ping works was different (I thought they get sent only on an idle connection, not while waiting for a response)

Pings are only sent when idle, but when they're sent without a response - that's taken as a signal that something's wrong with the connection.

I.e. pings are used to distinguish between:

  1. The connection is healthy and just hasn't had any activity
  2. The connection is broken, and that's why it seems it hasn't had any activity

Note that the treat-missing-pong-as-disconnected logic is only currently enabled by default for client->server pings, not server->client pings.

The latter was only added in a recent version of Sente, and for reasons explained here I didn't want to enable this by default yet. It can be enabled manually.

But I do know right now that adding an explicit :ws-ping-timeout-ms 360000 parameter to my sente/make-channel-socket-client! call makes things work for these two users.

Interesting. Just to confirm:

  • Are we still talking about failure to login here? I.e. with default :ws-ping-timeout-ms the users cannot log in, but with :ws-ping-timeout 360000 they can?

  • If the default client is killing connections due to pong failure, that seems to imply that the server is failing to reply to a pong request in the default 20s. Is it possible that your server is overloaded from handling the slow requests and therefore pong responses are being delayed? You mentioned that you're using http-kit - what version, and how many server threads do you have configured?

@ptaoussanis
Copy link
Member

Great, thanks for the answers 👍

My current sorted hunches would be:

  1. http-kit is getting starved of available thread workers by the long response handler, leading to client->server pings going unanswered, leading to client-initiated disconnects.

  2. nginx has some sort of hidden timeout that's interfering.

We could rule out (1) based on your http-kit server version and/or config (notably thread count).

And we could rule out (2) if you could maybe share the relevant parts of your nginx config. (Feel free to email if there's anything in there you'd rather not post publicly).

If we can rule out both, I'll continue down the chain.

@jwr
Copy link
Author

jwr commented Nov 2, 2023

I use http-kit 2.7.0, mostly with defaults. The only parameters to http-kit/run-server are increased :max-body and vastly increased :max-ws (I normally use 64MB, I increased this to 128MB for testing now). As for nginx, that could indeed be a factor. But I think we should concentrate on the client side and sente, because I managed to reproduce the issue locally.

I can reproduce the problem when running my app locally (http-kit only, no nginx proxying, single client connection) and connecting with Chrome with network throttling set to "Fast 3G". Here are the relevant logs, edited for clarity, note that the log includes both server-side and client-side:

2023-11-02T16:04:59.926Z INFO 2023-11-02T16:04:59.899Z INFO [taoensso.sente:1138] - Client chsk now open
2023-11-02T16:05:00.302Z DEBUG [events:1066] - event :db/setup-changefeed
2023-11-02T16:05:42.830Z DEBUG 2023-11-02T16:05:39.297Z DEBUG [taoensso.sente:1608] - Client will send ws-ping to server: {:ms-since-last-activity 38941, :timeout-ms 5000}
2023-11-02T16:05:42.854Z DEBUG [events:1129] - :chsk/ws-ping
2023-11-02T16:05:44.329Z DEBUG 2023-11-02T16:05:44.304Z DEBUG [taoensso.sente:1618] - Client ws-ping to server timed-out, will cycle WebSocket now
2023-11-02T16:05:44.902Z INFO 2023-11-02T16:05:44.892Z INFO [taoensso.sente:1138] - Client chsk now open
2023-11-02T16:05:45.132Z DEBUG [events:1066] - event :db/setup-changefeed
2023-11-02T16:06:24.334Z DEBUG 2023-11-02T16:06:24.314Z DEBUG [taoensso.sente:1608] - Client will send ws-ping to server: {:ms-since-last-activity 39151, :timeout-ms 5000}
2023-11-02T16:06:24.351Z DEBUG [events:1129] - :chsk/ws-ping
2023-11-02T16:06:29.344Z DEBUG 2023-11-02T16:06:29.320Z DEBUG [taoensso.sente:1618] - Client ws-ping to server timed-out, will cycle WebSocket now
2023-11-02T16:06:29.933Z INFO 2023-11-02T16:06:29.914Z INFO [taoensso.sente:1138] - Client chsk now open
2023-11-02T16:06:30.163Z DEBUG [events:1066] - event :db/setup-changefeed

What seems to be happening is that the socket gets opened, and my software immediately sends a :db/setup-changefeed event. That gets processed asynchronously server-side (e.g. the setup-changefeed event itself only gets a short OK response). The processing takes several seconds here, and the data is sent in a single message (using send-fn, as a server->client message) to the client over the throttled network connection. So, after about 6-7 seconds the server started sending the message to the client.

After about 40s (note the client-side timestamps can be different from server-side ones) sente sends a ws-ping message, which is received on the server. And 5s later, once the timeout-ms elapses, sente terminates the websocket connection and reopens it again. Presumably, the ws-pong response was generated on the server, but could not make it to the client within those 5 seconds, because the websocket connection was still busy transmitting the large server->client message that was sent several seconds after the first :db/setup-changefeed event.

So, the 45s interval comes from the sum of ms-since-last-activity and timeout-ms.

After that, the cycle repeats — and the data load never completes, because it never has the chance to arrive in full.

There is still much that I do not understand here. I don't understand the sente concept of an 'idle connection'. And my mental model of a sente connection and pings was incorrect (though to be honest I never gave it much thought): I thought of a sente connection like a TCP connection, where "activity" is defined as any data bytes being sent or received. In other words, I thought a sente connection that is receiving data would be "active".

I also do not understand why this only came up recently. I have many users with much longer load times. Somehow this interaction does not always come into play.

I think with the current behavior of ws-ping and my usage patterns (requests with potentially large responses over slow networks), I can't use the keepalive mechanism at all. I can't think of a reasonable timeout value here, other than a large one like several minutes, which I think defeats the purpose of the mechanism.

I hope this moves us forward! I also hope some of this can result in an improvement to Sente for everyone.

@ptaoussanis
Copy link
Member

Hi Jan,

I use http-kit 2.7.0, mostly with defaults. The only parameters to http-kit/run-server are increased :max-body and vastly increased :max-ws (I normally use 64MB, I increased this to 128MB for testing now).

That will be a problem if you've got slow synchronous handlers. http-kit 2.7.0 only allocates 4 worker threads by default, and so can easily become starved of threads in this case.

If that happens, it won't be able to respond to client ping requests - causing client's to disconnect.

Would suggest you set http-kit server's :thread option to something like 64, or more (depending on your core count). Alternatively, the current 2.8 beta will try select a reasonable default based on core count or use virtual threads if you're on Java 21+.

But I think we should concentrate on the client side and sente, because I managed to reproduce the issue locally.

👍

After about 40s (note the client-side timestamps can be different from server-side ones) sente sends a ws-ping message, which is received on the server. And 5s later, once the timeout-ms elapses, sente terminates the websocket connection and reopens it again.

This is the problem that I'm pointing out above. Your http-kit server should be able to reply with a pong if it's not thread-starved.

[...] but could not make it to the client within those 5 seconds, because the websocket connection was still busy transmitting the large server->client message that was sent several seconds after the first :db/setup-changefeed event.

Unless your payload is very large and connection very slow, thread starvation seems a lot more likely cause to me. I.e. my hunch is that your slow Ring handler isn't spending the majority of its time on WebSocket IO but on preparing the response.

Would recommend some simple profiling to be sure. Tufte is one option, but some simple adhoc (System/currentTimeMillis)s should suffice in your case.

I also do not understand why this only came up recently. I have many users with much longer load times. Somehow this interaction does not always come into play.

One possible explanation would be that as your concurrent user count and/or Ring handler costs have increased, you're running into thread starvation more often.

Slow or flakey connections may be especially sensitive since they'll have the additional network delay to contend with.

I think with the current behavior of ws-ping and my usage patterns (requests with potentially large responses over slow networks), I can't use the keepalive mechanism at all. I can't think of a reasonable timeout value here, other than a large one like several minutes, which I think defeats the purpose of the mechanism.

I don't believe that your usage pattern should be a problem for the ping behaviour. You might want to tweak the client-side :ws-ping-timeout-ms to something like 10s for an extra safety margin in the case of very large payloads. But if my hunch is correct, that shouldn't even be necessary.

My advice would be to try bump http-kit server's :thread option to at least 64 and/or add some profiling to your slow handler/s to better understand what proportion of response time is actually network IO (I suspect not much, but it'd be nice to confirm).

Please let me know how that goes.

@jwr
Copy link
Author

jwr commented Nov 2, 2023

Well, now that I have the problem reproducible, testing this hypothesis is easy. I added :thread 64 to http-kit/run-server option map. It did not change anything in the behavior.

I would be surprised if it did: right now I am testing in a local setup, so there is a dedicated http-kit server with a single Chrome client. That single client downloads some static content and then opens a single websocket connection. There are no other clients connecting and no other traffic. I wouldn't expect that to lead to thread starvation.

Looking back at how ws-ping works, I am not sure how we can expect the ping response to make it back in time to the client, if the network is slow and the websocket connection is busy transmitting a large amount of data. If the response is stuck behind, say, several megabytes of data, and transmitting that data takes longer than 5s, it is going to timeout, right?

I'm looking at this line in Sente: https://github.com/taoensso/sente/blob/a51a54a6d0372e7284e0c322b2c75e3804dbe1f8/src/taoensso/sente.cljc#L1511C25-L1511C63 — it seems that this (reset! udt-last-comms_ (enc/now-udt)) will happen upon receiving a complete message. In my case, the server->client message is large, so the client sits there and thinks that the connection is inactive. It isn't, it's just that a single large message is being transmitted. If I understand correctly, that's what causes Sente to send a ws-ping after about 40s of "inactivity", and the response to that ping gets delayed on its way back, behind all the data that is being transmitted, but which Sente hasn't yet heard about, at least as far as udt-last-comms_ is concerned. After 5s, Sente kills the connection. Does this explanation make sense?

@ptaoussanis
Copy link
Member

Hi Jan, your explanation makes sense - thanks for all the work debugging 👍 I realise this is time away from your business, so probably frustrating.

I am testing in a local setup, so there is a dedicated http-kit server with a single Chrome client. That single client downloads some static content and then opens a single websocket connection. There are no other clients connecting and no other traffic. I wouldn't expect that to lead to thread starvation.

👍 Though I'll note that it's possible to have even a single webpage issue multiple HTTP requests to different endpoints. Since we're talking about only 4 threads, it's not too difficult to get starved if there's expensive endpoints being hit.

Looking back at how ws-ping works, I am not sure how we can expect the ping response to make it back in time to the client, if the network is slow and the websocket connection is busy transmitting a large amount of data. If the response is stuck behind, say, several megabytes of data, and transmitting that data takes longer than 5s, it is going to timeout, right?

A lot depends on how large the data and how slow the connection. If possible, it'd really be helpful to get some real numbers. Could you maybe check on the payload size in your tests demonstrating the problem? (Again, assuming I'm not missing some difficulty in checking that number).

As an example, let's say a typical payload is 2Mb and we're on a 1Mbit/sec connection. That'd mean ~16 secs to do the transfer.

Will that cause a disconnection? It depends on when the request is sent.

The worst case with default options is:

  • The connection has been idle for 23 seconds
  • The large transfer is started at t=24 seconds, and will take 16 seconds to complete
  • The client issues ws-ping at t=25 seconds and expects a pong by t=30 seconds

I'd expect that to disconnect since the transfer will be in flight during the precise period that a pong is expected.

But if the payload is 10Mb on the same connection, then it doesn't even matter when the request is sent - since the transfer time (80 seconds) will certainly overlap the pong window and lead to a disconnect.

If you are potentially talking about payloads of this kind of size (and/or connections this slow), then that definitely sounds like the source of trouble. My first recommendation in that case would be to move the large payloads off Sente entirely.

The big benefit of Sente/WebSockets is the ability to easily have ~bidirectional real-time comms. It had actually not crossed my mind before that someone might use a Sente channel for large data transfers so I hadn't considered the implications.

It might work, to a point - but your example does highlight one of the issues. You might be able to try tune the timeouts, etc. - but even in the best case you'd still ultimately be tying up your WebSocket channel for no benefit.

I'd recommend instead using your Sente channel only for small data (max transfer of a few seconds), and for signalling. E.g. the server could signal to the client that it should request payload X via Ajax, then the client can make that a separate request and leave Sente's channel open for notifications, etc.

My own applications always use a mix of Sente and Ajax, since with Ajax you also have all the usual benefits of response caching, etc.

Does that make sense? Would that be viable in your case? If not, please let me know why and I'll consider alternative ideas.

If it's any help, there's a convenient Ajax util in Sente to alias this.

I'll note: Sente's documentation definitely should make it clear to avoid large data transfers. I'm really sorry about the oversight! I'll get the documentation updated tomorrow.

As an aside: I would strongly recommend keeping the higher http-kit thread count, since that's undoubtedly going to lead to trouble at some stage even if it wasn't the cause of the trouble here.

@ptaoussanis
Copy link
Member

Hi Jan, some updates:

  • I've opened an issue re: large data transfers.
  • I've added documentation warning about the limitation and explaining the suggested workaround.

Next time I'm on batched Sente work, I'll pursue the other items on the checklist.

@jwr
Copy link
Author

jwr commented Nov 4, 2023

To provide some context: I don't really have the option of sending data via different channels. The whole point of using Sente in my application is to tie the client application to RethinkDB changefeeds. Here is a somewhat simplified explanation: when a user logs in, a changefeed is set up in the database to that user's data. That changefeed receives the initial data and then all subsequent changes. That changefeed is also tied to the Sente websocket connection. This needs to be transactional: you get all the data as of a certain point in time, and then get all the changes to that data. There is no way to safely and correctly do this in two separate operations.

Of course things are much more involved that what I described (multiple changefeeds per user, etc), but the general concept holds.

I am working on a rewrite that will use FoundationDB instead of RethinkDB. Given all the downsides of websockets I plan to stop using them altogether in the future. FoundationDB lets me implement similar safe transactional changefeed functionality using a distributed database, but also without the burden of persistent database changefeed connections or persistent websockets. Polling architectures are generally simpler and more resilient, so that's what I plan to move to.

In the meantime, I will keep increased ws-ping timeouts, and also look at splitting the large data message into smaller ones, which should be doable. This will result in more frequent updates to Sente's activity concept (udt-last-comms_).

I keep thinking that what would solve the problem right away would be a way to update udt-last-comms_ based on network activity (e.g. bytes received) instead of after a complete message. Is that possible?

@ptaoussanis
Copy link
Member

Thanks for the extra info. I'm not familiar with RethinkDB so can't comment in detail, but that does sound unfortunate that you seem to have such limited control over how data is sent.

Given all the downsides of websockets

Just to make sure we're on the same page: besides the unsuitability for large data transfers, what downsides do you have in mind?

In the meantime, I will keep increased ws-ping timeouts, and also look at splitting the large data message into smaller ones, which should be doable.

Well if it's possible to split the large data message into smaller ones, that would certainly help the present isssue.

Polling architectures are generally simpler and more resilient, so that's what I plan to move to.

I'm not entirely clear on what what you're comparing here, but in case it's relevant - just double checking that you're aware that you can also disable WebSockets on Sente and just run it over long-polling?

I keep thinking that what would solve the problem right away would be a way to update udt-last-comms_ based on network activity (e.g. bytes received) instead of after a complete message. Is that possible?

Not with a WebSocket as far as I'm aware, but I haven't looked into it in detail.

@jwr
Copy link
Author

jwr commented Nov 6, 2023

I'm not familiar with RethinkDB so can't comment in detail, but that does sound unfortunate that you seem to have such limited control over how data is sent.

I do have control over how data is sent (I'm the one sending it), but I have to worry about correctness. Writing chat apps is easy, writing ERP apps less so :-) The key here is that the client needs to get the full data (up to a certain point in db-transactional-time) and then a stream of subsequent changes. On the server, that's a single "establish a changefeed" database operation. I can either map it roughly 1:1 to a client websocket connection by just dumping the data over the connection to the clients, or maintain a (costly and complex) system for caching that information and providing it to clients over AJAX calls.

The RethinkDB changefeed system solves a difficult problem really well, and together with Sente was a good solution in my case for more than 8 years now. Unfortunately, RethinkDB did not become fashionable (unlike the substantially worse MongoDB which it was often compared to), and it doesn't get much development anymore. That's why I'm planning to migrate to another database.

Another way to approach this kind of problem is a bi-temporal database (get db state to a specified point in time, then poll for changes afterwards). Or any database with a data model that allows for detecting changes after a point in time in a correct way, which is what I'm working towards with FoundationDB. But my current database does not allow me to ask for changes up to a point in time and then get updates after that point in time in a performant and transactionally correct way.

Just to make sure we're on the same page: besides the unsuitability for large data transfers, what downsides do you have in mind?

Multiple things:

  1. Websockets are a second-class citizen in the web world. In-browser developer tools treat them as totally opaque, nginx added support, but it isn't always clear which configuration options (such as timeouts) apply to them. Few people use them, and it's difficult to debug problems. They are also only half-implemented: for example RFC 6455 specifies the ping/pong control messages (https://datatracker.ietf.org/doc/html/rfc6455#section-5.5.2), but they don't seem to be implemented or supported anywhere.
  2. Browser authors don't care about websockets. See for example https://bugzilla.mozilla.org/show_bug.cgi?id=858538 and https://bugzilla.mozilla.org/show_bug.cgi?id=896666 for an 11-year old problem that has been affecting me directly, with no resolution in sight. Nobody cares.
  3. Since they aren't popular, buggy firewalls and proxies sometimes interfere.
  4. Websockets maintain a lot of state on both sides of the connection. Stateless solutions are always simpler.
  5. Apps with authentication quickly get complex. It is necessary to maintain auth status for two separate entities, one of which (the websocket) also changes state asynchronously. This is relatively manageable for a simple app with a login, but gets complex quickly if you also want a demo with autologin and anonymous connections. I have an entire "connection engine" which I've rewritten several times over the years, and I haven't managed to make it any simpler, or more reliable. And to this day 403 responses to socket establishment are not handled properly in my app and I don't have the time to dive in and debug this (I thought sente/make-channel-socket-client would return nil, but it doesn't).
  6. EDIT: Forgot to add: lack of compression support. My app could really benefit from data compression, but I can't just use the transparent compression that nginx does for everything else. Using transit improves things slightly, but compression would change a lot.

This is why, as I'm redesigning the data model to take advantage of the incredible features that FoundationDB offers, I am also making sure that I will be able to move to a simple polling model. If the cost of checking for database changes is nearly 0, polling an endpoint is a great solution and would let me get rid of a lot of complexity. Yes, I am aware that I can use Sente with long-polling only, but much as I like Sente, I'd rather not use it if I can, like with every other piece of code. It wouldn't bring many advantages in that case.

Now, trying to slowly wrap this up:

  • I increased the values for :ws-kalive-ms and :ws-ping-timeout-ms.
  • I increased the http-kit thread count, which was an excellent suggestion (thank you!) and while it probably wasn't a factor here, is a good idea in any case.
  • I took a close look at nginx proxy timeouts, which were not a factor here, but deserved a closer look. Ended up adjusting proxy_read_timeout and keepalive_time. It isn't entirely clear if keepalive_timeout applies to websockets, my conclusion was that it doesn't.
  • I looked at splitting the data into smaller chunks, which is possible, but I don't think it's worth the effort: the only goal would be to work around Sente's ping mechanism limitations.
  • In the long run, I plan to stop using websockets anyway, so I'd rather not invest a lot of effort into adapting to their limitations.
  • I still don't fully understand why this problem started appearing recently and did not affect more of my users.
  • I do not see a bug in Sente to be "fixed". A warning about large data transfers interfering with the ping mechanism is fine.
  • One thing that came to mind and could be relatively straightforward to implement would be to make :ws-ping-timeout-ms (and perhaps :ws-kalive-ms) dynamically adjustable. I would use a large timeout value initially, then adjust it after the initial data load, thus still keeping the keepalive mechanism useful and responsive.

@ptaoussanis
Copy link
Member

ptaoussanis commented Nov 6, 2023

Hi Jan, thanks for the detailed and thoughtful reply - that definitely helps me understand if there's anything I can improve on Sente's end (or if there's anything constructive I can suggest).

In short: your current plan sounds reasonable to me given what I understand of your architecture and objectives.

https://bugzilla.mozilla.org/show_bug.cgi?id=858538

This is good to know about, thanks for the link!

I increased the http-kit thread count, which was an excellent suggestion (thank you!) and while it probably wasn't a factor here, is a good idea in any case.

In case it's helpful while you're redesigning things, I'll note that http-kit isn't particularly well suited to large data transfers in general. One example: its IO is single threaded, so there's a limit to how much data can ultimately be served by a single instance.

In my experience that limit is rarely hit in real-world applications, but it can happen if you have high user loads with high IO per user. This is most likely in cases where you have large servers (>16 core) where it's possible to produce sustained heavy IO that dominates over CPU load.

But a lot depends on the particular load type and application architecture. I've recently added a very thorough benchmark suite to http-kit which may be informative if this is ever something you'd want to explore.

Ended up adjusting proxy_read_timeout and keepalive_time

My main concerns would be proxy_read_timeout, proxy_send_timeout, proxy_connect_timeout, and proxy_buffering. I too believe keepalive_time shouldn't be relevant for WebSocket connections.

One thing that came to mind and could be relatively straightforward to implement would be to make :ws-ping-timeout-ms (and perhaps :ws-kalive-ms) dynamically adjustable

In principle that'd be easy to do, but would push a fair amount of incidental complexity to application authors that I'd prefer to avoid. If there were a desire to officially support large payloads through Sente, I think my first inclination would be to add auto chunking on Sente's side.

But ultimately:

much as I like Sente, I'd rather not use it if I can, like with every other piece of code. It wouldn't bring many advantages in that case.

I'm 100% in support of this conclusion. If you're not benefitting from the specific advantages that Sente offers, then far better to remove it from your stack. The less software you can run the better 👍

Now, trying to slowly wrap this up

Feel free to close if you're satisfied with your current workaround 👍

And feel free to ping any time if you have other questions or if there's some other way that I can assist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants