Handling bursty/batch/bulk data dumps #19896

srstrickland · 2024-02-17T19:03:17Z

srstrickland
Feb 17, 2024

Vector is currently handling log shipping for our "realtime" environments, where logs are streamed out from running processes. But we have a number of other "offline" processes which generate logs in a disconnected environment that then need to be uploaded in batch. It seems fairly trivial to set up an S3 / SQS ingestion mechanism for this where users or applications just dump logfiles to S3 (and in fact I've already built & verified this), but I'm worried about the bursty nature of this data and how it might affect the other realtime things.

Vector is built to be fast, and for most applications "as fast as possible" or "as fast as the sinks will accept" is the answer. But I'm not finding many ways to slow things down, "smooth out" data delivery, or control flows from different sources (in an attempt to be "fair"). What happens if someone drops a 1TB log file to s3? It'll be read as fast as possible, and very possibly starve the data coming from the (arguably more important) realtime applications. And generally this data is going to be very bursty, in stark contrast to the rest of the system.

And to be fair, there are things I can build outside of vector to help with this (like an ingestion pipeline to kafka with intentional rate limits built by hand), and I could build a completely separate pipeline with different resource allocations (which might be a good idea anyways), but I just wanted to see if anyone else has had similar use cases. Fairness and flow control might be generally useful, and if I can keep everything inside vector, it makes things a lot simpler operationally.

Backpressure & adaptive concurrency are natural for vector, so it got me thinking along a few lines:

If VRL had some kind of sleep mechanism, I could sort of manually control flow through a remap transform by doing something like:

# approximately every 10000 items, sleep for 1s
if (random_float(0.0, 1.0) < 0.0001) { sleep(1) }

Maybe sources could expose a configuration for some kind of speed limit (either bytes/sec or events/sec)?
I initially thought the throttle transform might be a good home for this, with a "block" option, but seems really tricky because throttle is (optionally) key-based, and blocking per key seems... impossible. But maybe some rate_limit transform which just operates on the entire stream?

Any other ideas? Thoughts? Am I way off base?

srstrickland · 2024-02-18T01:52:04Z

srstrickland
Feb 18, 2024
Author

Had another idea, thought I would share. Create a configmap:

[default]
s3 =
  max_concurrent_requests = 10
  max_bandwidth = 10MB/s

(and whatever other config params might be useful here)

And set the env var: AWS_CONFIG_FILE=/path/to/config/above

I don't know if this will be honored by vector, but I think it would, assuming that vector is using AWS libs.

If this works, it's probably good enough for my use case for now. But I wonder if this is something that would be worth building into vector as a general feature? 🤔

2 replies

jszwedko Feb 20, 2024
Maintainer

It is also possible to apply backpressure by limiting concurrency in sinks (depending on the sink). For example, if you have an http sink you can configure request.concurrency to a low value to limit the throughput. This back-pressure will propagate back up to the source, in this case the aws_s3 source. The AWS SDK config options you found sound like they'd fit this case even better though, if they work.

srstrickland Feb 20, 2024
Author

Initial attempts to limit the s3 bandwidth using the mechanism above don't seem to be doing anything (same speed reading a 20mb file no matter what I put in the max_bandwidth field)... and I found this issue which leads me to believe the rust s3 sdk doesn't honor these configurations.

The problem is that my sinks are shared, I don't want to slow down all my sources (just the batch ones). I suppose I could wire up some dummy sink (reading only from the batch sources) and limit its concurrency, but this seems like a pretty roundabout mechanism... I have to sort of "invent" a sink, and play around with the concurrency / batch settings until I get "close" to what I need.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling bursty/batch/bulk data dumps #19896

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Handling bursty/batch/bulk data dumps #19896

srstrickland Feb 17, 2024

Replies: 1 comment · 2 replies

srstrickland Feb 18, 2024 Author

jszwedko Feb 20, 2024 Maintainer

srstrickland Feb 20, 2024 Author

srstrickland
Feb 17, 2024

Replies: 1 comment 2 replies

srstrickland
Feb 18, 2024
Author

jszwedko Feb 20, 2024
Maintainer

srstrickland Feb 20, 2024
Author