-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(snownet): only send relay candidates if hole-punching fails #4268
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Ignored Deployment
|
Terraform Cloud Plan Output
|
Still need to test this properly. |
This bump includes a fix that triggers a panic on unknown interfaces (algesten/str0m#493). The panic is what is currently blocking #4268 from proceeding.
691f8d5
to
144ed32
Compare
Performance Test ResultsTCP
UDP
|
144ed32
to
2de7ade
Compare
2de7ade
to
ebbd9f9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm probably not understanding something, but can't we send relay candidates at the start and just wait 2 seconds to use them if holepunching fails?
I.e. do this in parallel rather than adding 2 seconds for every connection that's stuck on a Relay.
When that happens it'll likely be for an entire org or a user's home office or something.
The difference here is only the RTT of sending the candidate to the portal, right? I'd expect that to be on the order of 50ms or less in total. Is that really worth it? Or what do you mean by "use" them? As soon as I send a relay candidate to the gateway, it will test its connectivity and might nominate it which might lead to #4290 if the direct connection fails somehow. I can temporarily hold it on the gateway side but that isn't any different from holding it on the client side modulo the RTT of the websocket connection.
Do what in parallel? Note that most connections take a second to be established (a good chunk of which is setup with the portal of requesting gateways etc). We can reduce the delay to 1s but that could lead to some connections still testing relay candidates even though the direct connection is about to succeed. Also note the comments in #4164. |
ebbd9f9
to
129e175
Compare
Ah, got it. Candidates have already been gathered, this just pauses the final step. |
129e175
to
7280b23
Compare
@conectado This is ready for review. I tried to make meaningful commits. |
for connection in self.connections.established.values_mut() { | ||
connection.created_at = now; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind of critical. Upon re-connect, we need to reset this timestamp to ensure we have a new grace-period and don't immediately emit all (new) relay candidates. Once #4585 is merged, I'll try to add a unit-test for this.
I don't understand how this fixes #4290, I remember that we have priorities set where we always pick a direct connection over a relayed one, and str0m should test all candidate pairs regardless the order they arrive? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good, though I'm not sure if this will feel a bit less responsive when direct isn't possible, could 2 seconds be too high of a timeout?
I can remove the "resolves" again and close the other one as not reproducible if you think that is more correct? In theory, ICE should always pick the best one, yes. I don't know why we sometimes don't do that. What I do know is that we currently do a lot of unnecessary work like allocating channels on relays repeatedly and testing their connectivity when in most cases, a direct connection is possible. We can optimise this in the future by detecting and remembering our NAT status eagerly, see #4060. |
FWIW, I haven't seen this happen ever since we fixed the Gateway and Phoenix Channel ref bugs. That could have fixed this by virtue of ensuring all candidates arrive reliably. Another issue I haven't seen reproduce since the above was fixed is #4058 This PR may be fixing an issue that no longer exists. |
I've removed the 2nd "resolves". With that taken into account, this PR is primarily an optimisation to avoid wasteful allocation of channels on relays. Not only does that create a lot of chatter in the logs, it also represents an unnecessarily limit of how many connections a single gateway can handle because there is no way of expiring a channel binding and currently, we bind dozens of them on every connection. With this in place, I would also like to remove the following optimisation: firezone/rust/connlib/snownet/src/node.rs Lines 205 to 220 in 31eec1a
This currently makes assumptions about which candidates are allowed to talk to each other. I do think it is sound but I would like to not make assumptions about network topologies. I think this optimisation makes much more sense. We should first try to hole-punch and if that doesn't succeed, try the relays. |
yeah :) |
Done! |
Yeah, doing this time-based isn't ideal. I'd like to work on #4060 which allows us to not pay this time-penality when we've previously detected that we are behind a restrictive NAT. |
Those will get seeded in once the connection is accepted.
7280b23
to
f601ed2
Compare
I think the tests are currently failing for the same reason as algesten/str0m#496. The non-roaming party doesn't know that the candidate was invalidated and thus keeps sending messages to it. |
I obviously don't know the full story here, but from a birds eye perspective it appears like working against the spirit of the ICE agent. The whole point of trickle ice is to be able to start connectivity early, before the complete enumeration of NIC and relays are discovered. It's a "throw all at the wall and see what sticks" kind of approach where everything is thrown as soon as it's found. Putting in a deliberate delay appears a bit counter to that idea. |
I am aware that we are bending ICE here a bit and I am open to other solutions. The side-effect of testing connectivity over a relay candidate is that we need to bind a channel (our TURN server doesn't support DATA indications). Not just one channel though, for each remote candidate that we receive, we need to allocate a channel on each allocation! This multiplies up to dozens if not hundreds of channels for each connection as soon as you use a few relays (for redundancy reasons for example). It seems somewhat wasteful to do that and might present a bottle-neck of how many connections we can establish per second because there are only 16k channels per allocation and they only expire after 10 minutes (+ a 5 minute grace period before they can be rebound). A node must therefore at most use ~17 channels per connection to be able to establish 1 connection per second. There might be other ways to optimise this but it seems the easiest to just first try hole-punching and if that doesn't succeed for a whille, send relay candidates. In the future, I am planning to more clever on this by detecting the NAT status early (#4060) and thus avoid this time penality for relayed connections. |
Outdated. |
Previously, we used to send all candidates as soon as we discover them. Whilst that may appear great for initial connectivity, it is fairly wasteful on both ends of the connection. In almost all cases, we are able to hole-punch a connection making the exchange of most candidates and the resulting binding of channels completely unnecessary.
To achieve this, we wait for 2 seconds before signaling relay candidates to the other party. In case we nominate a socket within those two seconds, we discard all those relay candidates because they are unnecessary.
Resolves: #4164.