p2p: Deprecate TTFB, RESP_TIMEOUT, introduce rate limiting recommenda… #3767

arnetheduck · 2024-05-14T15:17:21Z

…tions

As part of the discussions surrounding EIP-7594 (peerdas), it was highlighted that during sampling and/or data requests, the sampler does not have timing information for when a samplee will have data available. It is desireable to not introduce a deadline, since this artificially introduces latency for the typical scenario where data becomes available earlier than an agreed-upon deadline.

Similarly, when a client issues a request for blocks, it does often not know what rate limiting policy of the serving end and must either pessimistically rate limit itself or run the risk of getting disconnected for spamming the server - outcomes which lead to unnecessarily slow syncing as well as testnet mess with peer scoring and disconnection issues.

This PR solves both problems by:

removing the time-to-first-byte and response timeouts allowing requesters to optimistically queue requests - the timeouts have historically not been implemented fully in clients to this date
introducing a hard limit in the number of concurrent requests that a client may issue, per protocol
introducing a recommendation for rate limiting that allows optimal bandwidth usage without protocol changes or additional messaging roundtrips

On the server side, an "open" request does not consume significant resources while it's resting, meaning that allowing the server to manage resource allocation by slowing down data serving is safe, as long as concurrency is adequately limited.

On the client side, clients must be prepared to handle slow servers already and they can simply apply their existing strategy both to uncertainty and rate-limiting scenarios (how long before timeout, what to do in "slow peer" scenarios).

Token / leaky buckets are a classic option for rate limiting with desireable properties both for the case when we're sending requests to many clients concurrently (getting good burst performance) and when the requestees are busy (by keeping long-term resource usage in check and fairly serving clients)

…tions As part of the discussions surrounding EIP-7594 (peerdas), it was highlighted that during sampling and/or data requests, the sampler does not have timing information for when a samplee will have data available. It is desireable to not introduce a deadline, since this artificially introduces latency for the typical scenario where data becomes available earlier than an agreed-upon deadline. Similarly, when a client issues a request for blocks, it does often not know what rate limiting policy of the serving end and must either pessimistically rate limit itself or run the risk of getting disconnected for spamming the server - outcomes which lead to unnecessarily slow syncing as well as testnet mess with peer scoring and disconnection issues. This PR solves both problems by: * removing the time-to-first-byte and response timeouts allowing requesters to optimistically queue requests - the timeouts have historically not been implemented fully in clients to this date * introducing a hard limit in the number of concurrent requests that a client may issue, per protocol * introducing a recommendation for rate limiting that allows optimal bandwidth usage without protocol changes or additional messaging roundtrips On the server side, an "open" request does not consume significant resources while it's resting, meaning that allowing the server to manage resource allocation by slowing down data serving is safe, as long as concurrency is adequately limited. On the client side, clients must be prepared to handle slow servers already and they can simply apply their existing strategy both to uncertainty and rate-limiting scenarios (how long before timeout, what to do in "slow peer" scenarios). Token / leaky buckets are a classic option for rate limiting with desireable properties both for the case when we're sending requests to many clients concurrently (getting good burst performance) and when the requestees are busy (by keeping long-term resource usage in check and fairly serving clients)

ppopth · 2024-05-21T14:55:52Z

As part of the discussions surrounding EIP-7594 (peerdas), it was highlighted that during sampling and/or data requests, the sampler does not have timing information for when a samplee will have data available. It is desireable to not introduce a deadline, since this artificially introduces latency for the typical scenario where data becomes available earlier than an agreed-upon deadline.

That is the reason why I proposed passive sampling at #3717. Would you mind checking it out?

specs/phase0/p2p-interface.md

ppopth · 2024-05-21T15:01:41Z

specs/phase0/p2p-interface.md


-If any of these timeouts fire, the requester SHOULD reset the stream and deem the req/resp operation to have failed.
+If a timeout happens on the requesting side, they SHOULD reset the stream.


What kind of timeout can still happen? I think there is nothing left because you already removed TTFB_TIMEOUT and RESP_TIMEOUT. Or you left this line just in case we will want to re-enable this in the future?

as a consumer, you can implement whatever timeouts you want (in fact, you probably should) - this merely says what you should do to inform the server that you're no longer interested

ppopth · 2024-05-21T15:14:54Z

Is there any rationale of having the timeouts in the first place? Having two constants, TTFB_TIMEOUT and RESP_TIMEOUT, rather than just a single constant like TIMEOUT quite surprised me.

ppopth · 2024-05-21T15:20:22Z

specs/phase0/p2p-interface.md

+
+Broadly, the requesting side does not know the capacity / limit of each server but can derive it from the rate of responses for the purpose of selecting the next peer for a request.
+
+Because the server withholds the response until capacity is available, a client can optimistically send requests without risking running into negative scoring situations or sub-optimal rate polling.


Is there a DoS vector here that the clients can fill out the server's buffer?

the limit of open streams per protocol id replaces the previous mechanism (timeouts)

Co-authored-by: Pop Chunhapanya <[email protected]>

arnetheduck · 2024-05-22T06:05:11Z

Is there any rationale of having the timeouts in the first place?

to avoid too many concurrently open streams - it's a dubious mechanism at best to have in a spec - what timeout strategy is correct depends on how the consumer uses the request and/or the nature of the request

Having two constants, TTFB_TIMEOUT and RESP_TIMEOUT, rather than just a single constant like TIMEOUT quite surprised me.

ttfb is a way to get rid of stalling peers more quickly - imagine a request taking 10s to stream (because it's big and bandwidth is low) - ttfb allows closing the request more quickly if the responding end doesn't answer at all.

arnetheduck · 2024-05-22T06:06:49Z

passive sampling

happy to when we get to that point in our implementation, though this PR is useful independently (given how TTFB is not actually used correctly today)

nisdas · 2024-05-22T06:42:52Z

This PR would be pretty useful for us as we are seeking to revamp our rate limiting strategy. Currently with more topics where a large message size(blocks, blobs and soon columns) is being used it becomes messier in order to determine how to rate limit peers. Instead of trying to guess the other peer's rate limits it is much cleaner for us to simply wait(or terminate early if you do not want to wait). Every client has a different rate limiting strategy, rather than trying to cater to the most pessimistic one this allows each client to go at their own pace. Would be interested in other clients thoughts on this

ppopth · 2024-05-26T12:57:57Z

specs/phase0/p2p-interface.md

@@ -194,8 +195,8 @@ This section outlines configurations that are used in this spec.
 | `EPOCHS_PER_SUBNET_SUBSCRIPTION` | `2**8` (= 256) | Number of epochs on a subnet subscription (~27 hours) |
 | `MIN_EPOCHS_FOR_BLOCK_REQUESTS` | `MIN_VALIDATOR_WITHDRAWABILITY_DELAY + CHURN_LIMIT_QUOTIENT // 2` (= 33024, ~5 months) | The minimum epoch range over which a node must serve blocks |
 | `MAX_CHUNK_SIZE` | `10 * 2**20` (=10485760, 10 MiB) | The maximum allowed size of uncompressed req/resp chunked responses. |
-| `TTFB_TIMEOUT` | `5` | The maximum duration in **seconds** to wait for first byte of request response (time-to-first-byte). |
-| `RESP_TIMEOUT` | `10` | The maximum duration in **seconds** for complete response transfer. |
+| `TTFB_TIMEOUT` | N/A | TTFB should remain disabled. |


Why not completely remove both of them rather than marking them as disabled.

arnetheduck added 2 commits May 14, 2024 17:15

lint

f349bfc

ppopth reviewed May 21, 2024

View reviewed changes

specs/phase0/p2p-interface.md Outdated Show resolved Hide resolved

ppopth reviewed May 21, 2024

View reviewed changes

Update specs/phase0/p2p-interface.md

e4919d7

Co-authored-by: Pop Chunhapanya <[email protected]>

ackintosh mentioned this pull request May 22, 2024

Delayed RPC Send Using Tokens sigp/lighthouse#5785

Open

ppopth reviewed May 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

p2p: Deprecate TTFB, RESP_TIMEOUT, introduce rate limiting recommenda… #3767

p2p: Deprecate TTFB, RESP_TIMEOUT, introduce rate limiting recommenda… #3767

arnetheduck commented May 14, 2024

ppopth commented May 21, 2024

ppopth May 21, 2024

arnetheduck May 22, 2024

ppopth commented May 21, 2024

ppopth May 21, 2024

arnetheduck May 22, 2024

arnetheduck commented May 22, 2024

arnetheduck commented May 22, 2024

nisdas commented May 22, 2024

ppopth May 26, 2024


		If any of these timeouts fire, the requester SHOULD reset the stream and deem the req/resp operation to have failed.
		If a timeout happens on the requesting side, they SHOULD reset the stream.


		Broadly, the requesting side does not know the capacity / limit of each server but can derive it from the rate of responses for the purpose of selecting the next peer for a request.

		Because the server withholds the response until capacity is available, a client can optimistically send requests without risking running into negative scoring situations or sub-optimal rate polling.

p2p: Deprecate TTFB, RESP_TIMEOUT, introduce rate limiting recommenda… #3767

Are you sure you want to change the base?

p2p: Deprecate TTFB, RESP_TIMEOUT, introduce rate limiting recommenda… #3767

Conversation

arnetheduck commented May 14, 2024

ppopth commented May 21, 2024

ppopth May 21, 2024

Choose a reason for hiding this comment

arnetheduck May 22, 2024

Choose a reason for hiding this comment

ppopth commented May 21, 2024

ppopth May 21, 2024

Choose a reason for hiding this comment

arnetheduck May 22, 2024

Choose a reason for hiding this comment

arnetheduck commented May 22, 2024

arnetheduck commented May 22, 2024

nisdas commented May 22, 2024

ppopth May 26, 2024

Choose a reason for hiding this comment