refactor(snownet): only send relay candidates if hole-punching fails #4268

thomaseizinger · 2024-03-22T12:40:17Z

Previously, we used to send all candidates as soon as we discover them. Whilst that may appear great for initial connectivity, it is fairly wasteful on both ends of the connection. In almost all cases, we are able to hole-punch a connection making the exchange of most candidates and the resulting binding of channels completely unnecessary.

To achieve this, we wait for 2 seconds before signaling relay candidates to the other party. In case we nominate a socket within those two seconds, we discard all those relay candidates because they are unnecessary.

Resolves: #4164.

vercel · 2024-03-22T12:40:22Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
firezone	⬜️ Ignored (Inspect)	Visit Preview		Apr 15, 2024 1:13am

github-actions · 2024-03-22T12:42:12Z

Terraform Cloud Plan Output

Plan: 15 to add, 14 to change, 15 to destroy.

Terraform Cloud Plan

thomaseizinger · 2024-03-22T12:42:41Z

Still need to test this properly.

This bump includes a fix that triggers a panic on unknown interfaces (algesten/str0m#493). The panic is what is currently blocking #4268 from proceeding.

github-actions · 2024-04-10T09:31:40Z

Performance Test Results

TCP

Test Name	Received/s	Sent/s	Retransmits
direct-tcp-client2server	245.5 MiB (+1%)	248.0 MiB (+1%)	482 (+56%)
direct-tcp-server2client	239.6 MiB (-1%)	240.7 MiB (-1%)	241 (+40%)
relayed-tcp-client2server	223.5 MiB (-0%)	224.5 MiB (+0%)	239 (-27%)
relayed-tcp-server2client	239.8 MiB (+5%)	240.4 MiB (+5%)	272 (-14%)

UDP

Test Name	Total/s	Jitter	Lost
direct-udp-client2server	50.0 MiB (-0%)	0.04ms (+71%)	0.00% (NaN%)
direct-udp-server2client	50.0 MiB (+0%)	0.01ms (+59%)	0.00% (NaN%)
relayed-udp-client2server	50.0 MiB (+0%)	0.05ms (-4%)	0.00% (NaN%)
relayed-udp-server2client	50.0 MiB (-0%)	0.03ms (+376%)	0.00% (NaN%)

thomaseizinger · 2024-04-11T04:56:52Z

Resolves: #4290.

@jamilbk I am optimistically marking this as resolving the above issue. Given that we can't reproduce directly why it happens, this is a bit of a gamble. We can always re-open it if it appears again.

jamilbk

I'm probably not understanding something, but can't we send relay candidates at the start and just wait 2 seconds to use them if holepunching fails?

I.e. do this in parallel rather than adding 2 seconds for every connection that's stuck on a Relay.

When that happens it'll likely be for an entire org or a user's home office or something.

thomaseizinger · 2024-04-11T05:31:24Z

I'm probably not understanding something, but can't we send relay candidates at the start and just wait 2 seconds to use them if holepunching fails?

The difference here is only the RTT of sending the candidate to the portal, right? I'd expect that to be on the order of 50ms or less in total. Is that really worth it?

Or what do you mean by "use" them? As soon as I send a relay candidate to the gateway, it will test its connectivity and might nominate it which might lead to #4290 if the direct connection fails somehow. I can temporarily hold it on the gateway side but that isn't any different from holding it on the client side modulo the RTT of the websocket connection.

I.e. do this in parallel rather than adding 2 seconds for every connection that's stuck on a Relay.

Do what in parallel?

Note that most connections take a second to be established (a good chunk of which is setup with the portal of requesting gateways etc). We can reduce the delay to 1s but that could lead to some connections still testing relay candidates even though the direct connection is about to succeed.

Also note the comments in #4164.

jamilbk · 2024-04-11T06:16:04Z

Ah, got it. Candidates have already been gathered, this just pauses the final step.

thomaseizinger · 2024-04-12T00:13:23Z

@conectado This is ready for review. I tried to make meaningful commits.

thomaseizinger · 2024-04-12T00:16:38Z

rust/connlib/snownet/src/node.rs

+ for connection in self.connections.established.values_mut() {
+ connection.created_at = now;
+ }


This is kind of critical. Upon re-connect, we need to reset this timestamp to ensure we have a new grace-period and don't immediately emit all (new) relay candidates. Once #4585 is merged, I'll try to add a unit-test for this.

conectado · 2024-04-12T02:25:48Z

I don't understand how this fixes #4290, I remember that we have priorities set where we always pick a direct connection over a relayed one, and str0m should test all candidate pairs regardless the order they arrive?

conectado

Code looks good, though I'm not sure if this will feel a bit less responsive when direct isn't possible, could 2 seconds be too high of a timeout?

thomaseizinger · 2024-04-12T03:51:12Z

I don't understand how this fixes #4290, I remember that we have priorities set where we always pick a direct connection over a relayed one, and str0m should test all candidate pairs regardless the order they arrive?

I can remove the "resolves" again and close the other one as not reproducible if you think that is more correct?

In theory, ICE should always pick the best one, yes. I don't know why we sometimes don't do that. What I do know is that we currently do a lot of unnecessary work like allocating channels on relays repeatedly and testing their connectivity when in most cases, a direct connection is possible.

We can optimise this in the future by detecting and remembering our NAT status eagerly, see #4060.

jamilbk · 2024-04-12T04:24:48Z

I don't know why we sometimes don't do that.

FWIW, I haven't seen this happen ever since we fixed the Gateway and Phoenix Channel ref bugs. That could have fixed this by virtue of ensuring all candidates arrive reliably.

Another issue I haven't seen reproduce since the above was fixed is #4058

This PR may be fixing an issue that no longer exists.

thomaseizinger · 2024-04-12T05:15:24Z

This PR may be fixing an issue that no longer exists.

I've removed the 2nd "resolves".

With that taken into account, this PR is primarily an optimisation to avoid wasteful allocation of channels on relays. Not only does that create a lot of chatter in the logs, it also represents an unnecessarily limit of how many connections a single gateway can handle because there is no way of expiring a channel binding and currently, we bind dozens of them on every connection.

With this in place, I would also like to remove the following optimisation:

firezone/rust/connlib/snownet/src/node.rs

Lines 205 to 220 in 31eec1a

 match candidate.kind() { 

 CandidateKind::Host => { 

 // Binding a TURN channel for host candidates does not make sense. 

 // They are only useful to circumvent restrictive NATs in which case we are either talking to another relay candidate or a server-reflexive address. 

 return; 

 } 

 CandidateKind::Relayed => { 

 // Optimisatically try to bind the channel only on the same relay as the remote peer. 

 if let Some(allocation) = self.same_relay_as_peer(&candidate) { 

 allocation.bind_channel(candidate.addr(), now); 

 return; 

 } 

 } 

 CandidateKind::ServerReflexive | CandidateKind::PeerReflexive => {} 

 }

This currently makes assumptions about which candidates are allowed to talk to each other. I do think it is sound but I would like to not make assumptions about network topologies. I think this optimisation makes much more sense. We should first try to hole-punch and if that doesn't succeed, try the relays.

conectado · 2024-04-12T05:41:35Z

I can remove the "resolves" again and close the other one as not reproducible if you think that is more correct?

yeah :)

thomaseizinger · 2024-04-12T08:12:04Z

I can remove the "resolves" again and close the other one as not reproducible if you think that is more correct?

yeah :)

Done!

thomaseizinger · 2024-04-12T08:13:25Z

could 2 seconds be too high of a timeout?

Yeah, doing this time-based isn't ideal. I'd like to work on #4060 which allows us to not pay this time-penality when we've previously detected that we are behind a restrictive NAT.

Those will get seeded in once the connection is accepted.

thomaseizinger · 2024-04-15T08:13:05Z

I think the tests are currently failing for the same reason as algesten/str0m#496. The non-roaming party doesn't know that the candidate was invalidated and thus keeps sending messages to it.

algesten · 2024-04-15T08:19:50Z

I obviously don't know the full story here, but from a birds eye perspective it appears like working against the spirit of the ICE agent.

The whole point of trickle ice is to be able to start connectivity early, before the complete enumeration of NIC and relays are discovered. It's a "throw all at the wall and see what sticks" kind of approach where everything is thrown as soon as it's found. Putting in a deliberate delay appears a bit counter to that idea.

thomaseizinger · 2024-04-16T05:00:12Z

I obviously don't know the full story here, but from a birds eye perspective it appears like working against the spirit of the ICE agent.

The whole point of trickle ice is to be able to start connectivity early, before the complete enumeration of NIC and relays are discovered. It's a "throw all at the wall and see what sticks" kind of approach where everything is thrown as soon as it's found. Putting in a deliberate delay appears a bit counter to that idea.

I am aware that we are bending ICE here a bit and I am open to other solutions. The side-effect of testing connectivity over a relay candidate is that we need to bind a channel (our TURN server doesn't support DATA indications). Not just one channel though, for each remote candidate that we receive, we need to allocate a channel on each allocation! This multiplies up to dozens if not hundreds of channels for each connection as soon as you use a few relays (for redundancy reasons for example).

It seems somewhat wasteful to do that and might present a bottle-neck of how many connections we can establish per second because there are only 16k channels per allocation and they only expire after 10 minutes (+ a 5 minute grace period before they can be rebound). A node must therefore at most use ~17 channels per connection to be able to establish 1 connection per second.

There might be other ways to optimise this but it seems the easiest to just first try hole-punching and if that doesn't succeed for a whille, send relay candidates. In the future, I am planning to more clever on this by detecting the NAT status early (#4060) and thus avoid this time penality for relayed connections.

thomaseizinger · 2024-05-22T23:37:46Z

Outdated.

thomaseizinger requested a review from conectado March 22, 2024 12:40

thomaseizinger marked this pull request as draft March 22, 2024 12:42

jamilbk changed the title ~~feat(snownet): only send relay candidates if hole-punching fails~~ refactor(snownet): only send relay candidates if hole-punching fails Mar 22, 2024

This was referenced Apr 5, 2024

Host/Use STUN-only servers #4502

Closed

build(deps): bump str0m dependency #4555

Merged

github-merge-queue bot pushed a commit that referenced this pull request Apr 10, 2024

build(deps): bump str0m dependency (#4555)

c33ee10

This bump includes a fix that triggers a panic on unknown interfaces (algesten/str0m#493). The panic is what is currently blocking #4268 from proceeding.

thomaseizinger force-pushed the feat/connlib/delay-relay-candidates branch from 691f8d5 to 144ed32 Compare April 10, 2024 09:22

thomaseizinger force-pushed the feat/connlib/delay-relay-candidates branch from 144ed32 to 2de7ade Compare April 11, 2024 04:47

thomaseizinger marked this pull request as ready for review April 11, 2024 04:47

thomaseizinger marked this pull request as draft April 11, 2024 04:51

thomaseizinger force-pushed the feat/connlib/delay-relay-candidates branch from 2de7ade to ebbd9f9 Compare April 11, 2024 04:55

thomaseizinger changed the base branch from main to chore/snownet/test-harness-improvements April 11, 2024 04:55

jamilbk reviewed Apr 11, 2024

View reviewed changes

thomaseizinger force-pushed the feat/connlib/delay-relay-candidates branch from ebbd9f9 to 129e175 Compare April 11, 2024 05:49

Base automatically changed from chore/snownet/test-harness-improvements to main April 11, 2024 14:28

thomaseizinger force-pushed the feat/connlib/delay-relay-candidates branch from 129e175 to 7280b23 Compare April 12, 2024 00:12

thomaseizinger marked this pull request as ready for review April 12, 2024 00:12

thomaseizinger commented Apr 12, 2024

View reviewed changes

conectado approved these changes Apr 12, 2024

View reviewed changes

thomaseizinger mentioned this pull request Apr 12, 2024

snownet: detect NAT status #4060

Open

thomaseizinger added 14 commits April 15, 2024 11:07

Call seed_agent as part of init_connection

f0a6aeb

Seed host candidates outside

1cc1ad7

Reduce duplication in adding candidates

578c801

Move candidate seed function to Connection

6b9c3e0

Add all candidates at once

1911a41

Only send relay candidates if hole-punch doesn't work after 2s

118fa73

Don't add new candidates from allocations to initial connections

835270f

Those will get seeded in once the connection is accepted.

Remove unnecessary clone

a7a41a2

Add new local candidate using fn on Connection

90ddbed

Rename timestamp and reset it upon reconnect

27b444f

Clear fallback relay candidates once connected

aabca65

Add test that we don't use relay candidates if direct conn is possible

054e0ca

Assert hole-punch timeout

bca7a7a

Shove new candidates into fallback buffer instead of adding them

f601ed2

thomaseizinger force-pushed the feat/connlib/delay-relay-candidates branch from 7280b23 to f601ed2 Compare April 15, 2024 01:13

github-actions bot added the kind/refactor Code refactoring label Apr 15, 2024

thomaseizinger marked this pull request as draft April 15, 2024 08:13

thomaseizinger closed this May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(snownet): only send relay candidates if hole-punching fails #4268

refactor(snownet): only send relay candidates if hole-punching fails #4268

thomaseizinger commented Mar 22, 2024 •

edited

vercel bot commented Mar 22, 2024 •

edited

github-actions bot commented Mar 22, 2024 •

edited

thomaseizinger commented Mar 22, 2024

github-actions bot commented Apr 10, 2024 •

edited

thomaseizinger commented Apr 11, 2024

jamilbk left a comment

thomaseizinger commented Apr 11, 2024 •

edited

jamilbk commented Apr 11, 2024

thomaseizinger commented Apr 12, 2024

thomaseizinger Apr 12, 2024

conectado commented Apr 12, 2024

conectado left a comment

thomaseizinger commented Apr 12, 2024

jamilbk commented Apr 12, 2024

thomaseizinger commented Apr 12, 2024

conectado commented Apr 12, 2024

thomaseizinger commented Apr 12, 2024

thomaseizinger commented Apr 12, 2024

thomaseizinger commented Apr 15, 2024 •

edited

algesten commented Apr 15, 2024

thomaseizinger commented Apr 16, 2024

thomaseizinger commented May 22, 2024

refactor(snownet): only send relay candidates if hole-punching fails #4268

refactor(snownet): only send relay candidates if hole-punching fails #4268

Conversation

thomaseizinger commented Mar 22, 2024 • edited

vercel bot commented Mar 22, 2024 • edited

github-actions bot commented Mar 22, 2024 • edited

Terraform Cloud Plan Output

thomaseizinger commented Mar 22, 2024

github-actions bot commented Apr 10, 2024 • edited

Performance Test Results

TCP

UDP

thomaseizinger commented Apr 11, 2024

jamilbk left a comment

Choose a reason for hiding this comment

thomaseizinger commented Apr 11, 2024 • edited

jamilbk commented Apr 11, 2024

thomaseizinger commented Apr 12, 2024

thomaseizinger Apr 12, 2024

Choose a reason for hiding this comment

conectado commented Apr 12, 2024

conectado left a comment

Choose a reason for hiding this comment

thomaseizinger commented Apr 12, 2024

jamilbk commented Apr 12, 2024

thomaseizinger commented Apr 12, 2024

conectado commented Apr 12, 2024

thomaseizinger commented Apr 12, 2024

thomaseizinger commented Apr 12, 2024

thomaseizinger commented Apr 15, 2024 • edited

algesten commented Apr 15, 2024

thomaseizinger commented Apr 16, 2024

thomaseizinger commented May 22, 2024

thomaseizinger commented Mar 22, 2024 •

edited

vercel bot commented Mar 22, 2024 •

edited

github-actions bot commented Mar 22, 2024 •

edited

github-actions bot commented Apr 10, 2024 •

edited

thomaseizinger commented Apr 11, 2024 •

edited

thomaseizinger commented Apr 15, 2024 •

edited