Degrade provider handling quality gracefully under load #730

Stebalien · 2021-07-22T17:54:03Z

This PR contains two changes to scale provider handling:

It drops inbound provider records when we're under heavy load.
When under load, it only returns provider records to clients when the DHT node is in the provider record's "closest" bucket.

Ideally we'd have a some form of parallel provider record retrieval from the datastore, but this is still a good first step.

fixes #675

That way, overloaded nodes can drop provides.

And return when the process is closign. This will help speed up the main loop a bit.

Let's assume there's one (or zero) providers for a record and everyone is looking for 10 providers. - Before, peers would wait for all peers along the path to return this one provider. However, because there's only one provider, peers won't be able to short-circuit anyways. - Now, peers will go to the end of the path before waiting. This may make some queries slower, but it attempts to give "priority" to peers that actually _need_ responses as opposed to peers that are "optimistically" waiting for responses.

aschmahmann · 2021-07-22T18:45:44Z

Still need to come back to take a look more carefully, but a few thoughts:

I don't think dropping requests is really a reasonable thing to do unless:
a. You have some metric you can track telling you how many requests you're dropping
b. You have an option that allows you to increase the amount of resources you throw at the problem (i.e. Allow the ProviderManager to have more paralleism #729)
If we're concerned about this do we need to worry about fairness and have some sort of round-robin like we do for Bitswap?
a. Maybe not necessary to fix right now, but this is a server side change so upgrades here are pretty slow
How are we (people who design these networks) going to determine the correct parameters here? Is the idea that we keep the behavior the same by default, but allow some experimentation on busy nodes to determine what their loads tend to look like, before making any more general server-side changes?

Stebalien · 2021-07-22T19:09:26Z

I don't think dropping requests is really a reasonable thing to do unless:

1.b. #729 (comment)

And yes, you're right, we need to track this.

If we're concerned about this do we need to worry about fairness and have some sort of round-robin like we do for Bitswap?

In theory? But I still feel like this change is a strict improvement. It'll only kick in when overloaded anyways.

How are we (people who design these networks) going to determine the correct parameters here?

Really, I don't think there's too much to tune here.

The limit on the inbound provider queue size is more based on memory than anything. It helps with bursts, but won't actually increase throughput (much).
The limits on the get side should probably be tuned based on the expected latency. But that limit will only kick in if we're (a) under load and (b) not "responsible" for the record, so I'm not really concerned.

aschmahmann · 2021-07-22T19:26:28Z

handlers.go

@@ -366,7 +390,10 @@ func (dht *IpfsDHT) handleAddProvider(ctx context.Context, p peer.ID, pmes *pb.M
 // add the received addresses to our peerstore.
 dht.peerstore.AddAddrs(pi.ID, pi.Addrs, peerstore.ProviderAddrTTL)
 }
- dht.ProviderManager.AddProvider(ctx, key, p)
+ err := dht.ProviderManager.AddProviderNonBlocking(ctx, key, p)


Why not do a mybucket style check here as well?

As far as I understand, we're always in the bucket here, right?

aschmahmann · 2021-07-22T19:30:57Z

In theory? But I still feel like this change is a strict improvement. It'll only kick in when overloaded anyways.

👍

The limit on the inbound provider queue size is more based on memory than anything. It helps with bursts, but won't actually increase throughput (much).

Ok, I think you've reasonably convinced me. Since the datastore is batching it seems reasonable to expect that puts should be quick here (as long as they're not blocked on the event queue by gets). In which case if your datastore's slow there's not much to be done about anything else here (aside from queues for bursts like you mentioned).

However, since Gets can block Puts we still need to do some estimation on queue size (and parallelism) required for what we expect an average node to need. I suspect this won't be so awful given that the existing networks are pretty functional without this, so we mostly need some conservative estimates and let power users like infra providers tune more accurately over time.

Stebalien · 2021-07-23T05:12:33Z

However, since Gets can block Puts we still need to do some estimation on queue size (and parallelism) required for what we expect an average node to need. I suspect this won't be so awful given that the existing networks are pretty functional without this, so we mostly need some conservative estimates and let power users like infra providers tune more accurately over time.

What about having get workers? Basically, we could:

Quickly check the cache (inline).
If not there, pass the task off to a get worker.
Get the result back and cache it.

Doing this without blocking the main event loop is going to be a bit tricky, but doable.

mvdan · 2021-08-10T21:25:32Z

There's a bunch of servers using this library to subscribe to a topic and publish messages to one another, and from time to time there are huge goroutine spikes that straight up take the process down due to memory usage:

I'm fairly certain that I'm experiencing the same issue that Steven is trying to tackle here. All of those goroutines get stuck on the "select" trying to get their incoming request handled, in calls like (*ProviderManager).AddProvider and (*ProviderManager).GetProviders.

It's yet unclear why there are suddenly those spikes in requests, but at least I want the processes to not get taken down that easily by just a spike of 50-100k requests. So I think this pull request would help.

@Stebalien do you need help with reviews or testing? I'm not an expert in go-libp2p or this particular library, but I'm happy to help where I can or run this branch for a few days, if you think it's ready enough to give it a go.

Stebalien · 2021-08-10T23:14:14Z

The one thing I still wanted was "get" workers. At the moment, gets are serial and puts can easily get backed up on a single slow get.

aschmahmann · 2021-09-15T14:58:42Z

@petar when performance of large DHT nodes (e.g. the hydras) comes up on your radar again could you take a look at this and decide if it's ready for merge as-is, or what changes need to be made?

This pulls in libp2p/go-libp2p-kad-dht#730, which hangs on a branch shortly after 0.12.2. Essentially, this handles priority for requests a bit better, and drops unimportant requests if they come in too fast. This should prevent kad-dht from using tons of memory. Updates vocdoni#243.

This pulls in libp2p/go-libp2p-kad-dht#730, which hangs on a branch shortly after 0.12.2. Essentially, this handles priority for requests a bit better, and drops unimportant requests if they come in too fast. This should prevent kad-dht from using tons of memory. Updates #243.

Stebalien requested a review from aschmahmann July 22, 2021 17:54

Stebalien added 3 commits July 22, 2021 11:07

feat: buffer inbound provides and add a non-blocking option

47a1bf4

That way, overloaded nodes can drop provides.

feat: buffer getprovs slightly

08d6e8e

And return when the process is closign. This will help speed up the main loop a bit.

Stebalien force-pushed the feat/scale branch from 3ec4325 to 7706c7b Compare July 22, 2021 18:07

aschmahmann reviewed Jul 22, 2021

View reviewed changes

petar mentioned this pull request Jul 30, 2021

Hydra upgrade libp2p/hydra-booster#132

Open

9 tasks

petar changed the title ~~Scale provider handling~~ Degrade provider handling quality gracefully under load Jul 30, 2021

aschmahmann marked this pull request as draft September 15, 2021 14:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Degrade provider handling quality gracefully under load #730

Degrade provider handling quality gracefully under load #730

Stebalien commented Jul 22, 2021 •

edited

aschmahmann commented Jul 22, 2021 •

edited

Stebalien commented Jul 22, 2021

aschmahmann Jul 22, 2021

Stebalien Jul 22, 2021

aschmahmann commented Jul 22, 2021

Stebalien commented Jul 23, 2021

mvdan commented Aug 10, 2021

Stebalien commented Aug 10, 2021

aschmahmann commented Sep 15, 2021

Degrade provider handling quality gracefully under load #730

Are you sure you want to change the base?

Degrade provider handling quality gracefully under load #730

Conversation

Stebalien commented Jul 22, 2021 • edited

aschmahmann commented Jul 22, 2021 • edited

Stebalien commented Jul 22, 2021

aschmahmann Jul 22, 2021

Choose a reason for hiding this comment

Stebalien Jul 22, 2021

Choose a reason for hiding this comment

aschmahmann commented Jul 22, 2021

Stebalien commented Jul 23, 2021

mvdan commented Aug 10, 2021

Stebalien commented Aug 10, 2021

aschmahmann commented Sep 15, 2021

Stebalien commented Jul 22, 2021 •

edited

aschmahmann commented Jul 22, 2021 •

edited