Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Websocket failing to communicate after 20 seconds #435

Open
butzopower opened this issue Sep 2, 2023 · 9 comments
Open

Websocket failing to communicate after 20 seconds #435

butzopower opened this issue Sep 2, 2023 · 9 comments

Comments

@butzopower
Copy link

Completely at a loss as to this behavior and wondering if there's potentially any insight into this issue.

Using sente 1.19.2 for server with a JavaScript client, have been having issues where an opened websocket seems to either fail to send messages or the server fails to receive/process them after about 20 seconds. When leveraging sente in the client this usually leads to a ping failing to get a pong response and then the client reconnecting due to ws-ping-timeout-ms being reached.

But I've now isolated sente on the client completely out of the picture and am using purely browser Websocket in Chrome console:

socket = new WebSocket("wss://XXXX/chsk?client-id=XXXX&csrf-token=XXXX");
socket.onmessage = function(e) { console.log(e.data); };
socket.onerror = function(e) { console.warn(e); };
setInterval(function() {
    const message = `[[:chsk/ws-ping] "${Date.now()}"]`;
    console.log(`sending message: ${message}`);
    socket.send(message);
}, 1000);

And still this behavior happens where up until about 20 seconds the pings receive responses and then nothing until eventually the socket closes:

VM1657:2 [[:chsk/handshake ["butzopower" nil]]]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608078835"]
VM1657:2 ["pong" "1693608078835"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608079835"]
VM1657:2 ["pong" "1693608079835"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608080836"]
VM1657:2 ["pong" "1693608080836"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608081837"]
VM1657:2 ["pong" "1693608081837"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608082835"]
VM1657:2 ["pong" "1693608082835"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608083834"]
VM1657:2 ["pong" "1693608083834"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608084834"]
VM1657:2 ["pong" "1693608084834"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608085835"]
VM1657:2 ["pong" "1693608085835"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608086837"]
VM1657:2 ["pong" "1693608086837"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608087837"]
VM1657:2 ["pong" "1693608087837"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608088836"]
VM1657:2 ["pong" "1693608088836"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608089836"]
VM1657:2 ["pong" "1693608089836"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608090835"]
VM1657:2 ["pong" "1693608090835"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608091837"]
VM1657:2 ["pong" "1693608091837"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608092833"]
VM1657:2 ["pong" "1693608092833"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608093836"]
VM1657:2 ["pong" "1693608093836"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608094835"]
VM1657:2 ["pong" "1693608094835"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608095836"]
VM1657:2 ["pong" "1693608095836"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608096837"]
VM1657:2 ["pong" "1693608096837"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608097833"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608098834"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608099837"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608100834"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608101832"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608102833"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608103832"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608104837"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608105837"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608106836"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608107836"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608108837"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608109837"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608110837"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608111837"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608112837"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608113835"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608114836"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608115835"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608116835"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608117834"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608118836"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608119832"]
VM1657:6 sending message: [[:chsk/ws-ping] "1693608120833"]
VM1657:7 WebSocket is already in CLOSING or CLOSED state.
VM1657:6 sending message: [[:chsk/ws-ping] "1693608121833"]
VM1657:7 WebSocket is already in CLOSING or CLOSED state.
VM1657:6 sending message: [[:chsk/ws-ping] "1693608122835"]
VM1657:7 WebSocket is already in CLOSING or CLOSED state.

I've been able to replicate this consistently without fail, it's always seemingly the same amount of time 20 seconds +/- 1 second give or take.

What's even weirder to me is that if I use the same request headers in Postman I never run into this issue and can continue to ping the server indefinitely and always get a pong response back, so I'm also at a loss as to how this might be related to behavior on the server side of things.

Have switched all current users to force :ajax connections while this issue persists and besides classic issues with long polling things have been mostly working, so it does seem mainly isolated to websocket connections specifically.

@ptaoussanis ptaoussanis self-assigned this Sep 2, 2023
@ptaoussanis
Copy link
Member

@butzopower Hi Brian-

Using sente 1.19.2 for server with a JavaScript client

Could you please clarify what you by "with a JavaScript client"?

Are you able to reproduce the problem with the reference example project?

Thanks!

@butzopower
Copy link
Author

@ptaoussanis Sorry, I meant that the application client is using sente via cljs.

@butzopower
Copy link
Author

Unfortunately have not been able to reproduce this issue locally at all, everything just works fine. I think what's bizarre is that it does work in production but only for a very short window, and also only fails in the browser and not in Postman.

@ptaoussanis
Copy link
Member

I'm finding it a bit difficult to understand your issue as described, sorry.

It'd be helpful to start with more of a high-level picture before delving into Postman or manually-constructed JS requests.

What's the issue that you're seeing exactly?
You're seeing an unexpectedly large number of automated Sente reconnections, but only in production? How were you alerted to this? Is there relevant log output you could share?

Is production otherwise behaving as expected?

Assuming the issue seems to occur only in production, is there anything significantly different between your dev and production environments? E.g. does production involve a proxy or something similar?

Is there any way to run the reference example project in some kind of production-like environment to see if the issue is reproducible there?

@butzopower
Copy link
Author

Weirdly determined this was likely due to ISP filtering traffic (oddly after 20 seconds) as was able to corroborate the issue with (and only with) users on the same ISP. Was able to confirm ISP has stopped filtering now and everything is back to normal.

Sorry for the misreport, was a very weird (and annoying :)) scenario to be debugging within.

@ptaoussanis
Copy link
Member

@butzopower Hi Brian, thanks for the update.

Could you please share some more details about what kind of filtering the ISP was doing? I'd like to understand if Sente could do anything better in situations like this.

Thanks

@ptaoussanis ptaoussanis reopened this Oct 3, 2023
@ptaoussanis
Copy link
Member

@butzopower Hi Brian, just pinging about this - would appreciate your feedback on this if possible. Thanks!

@butzopower
Copy link
Author

@ptaoussanis Sorry for such a delayed update on this.

The ISP I was on was doing some kind of filtering on the CloudFlare IPs for our application in production - it required actually hitting the IP directly over http to actually see any notification that there was an issue.

How this manifested in Sente was much weirder as my client would be able to establish a connection with the production server via wss and would be able to send and receive data. Then, after about roughly 20 seconds any outbound messages to the server seemed to stop sending. Inbound broadcasts (e.g. not replies to outbounds) seemed to still come in until either until Sente closed the connection due to ping/pong keepalive failing or Chrome determined the websocket connection had died on its own (this usually happened when enough broadcasts came through to reset ping/pong). We'd configured Sente to reconnect in this case, and so it would and again work for 20 seconds or so and this cycle would just repeat.

I'm really not sure what the ISP was doing in this case, it seems like it might have been just deciding to terminate a protocol it didn't understand after 20 seconds (and immediately terminating protocols it did understand).

In terms of discovery of the underlying issue - once I was able to reproduce the issue in plain WebSocket without Sente I then realized I needed to dig further out and then just happened to eventually discover the issue when hitting the IPs directly. I'm not sure how it would be easier to detect this kind of anomaly.

To make matters even more confusing I was able to maintain an unbroken connection to the server when using Postman's WS client - I have no idea why or what manner it was creating the connection that led to it not being terminated by the ISP.

I think this was mainly a perfect storm of unfortunate circumstances not least of which was this anomaly leading to confusing reports to the rest of the people trying to debug! I did get to learn a lot about some of the inner workings of Sente along the way which is still some time well spent I think :) but I'm not sure what additional debugging could have been provided when the issue was so far removed from the actual library.

@ptaoussanis
Copy link
Member

ptaoussanis commented Nov 12, 2023

@butzopower No worries Brian, thanks so much for the comprehensive info - that's very helpful 🙏 I'll look into this further next time I'm doing batched work on Sente 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants