Handling bursty/batch/bulk data dumps #19896
Unanswered
srstrickland
asked this question in
Q&A
Replies: 1 comment 2 replies
-
Had another idea, thought I would share. Create a configmap:
(and whatever other config params might be useful here) And set the env var: I don't know if this will be honored by vector, but I think it would, assuming that vector is using AWS libs. If this works, it's probably good enough for my use case for now. But I wonder if this is something that would be worth building into vector as a general feature? 🤔 |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Vector is currently handling log shipping for our "realtime" environments, where logs are streamed out from running processes. But we have a number of other "offline" processes which generate logs in a disconnected environment that then need to be uploaded in batch. It seems fairly trivial to set up an S3 / SQS ingestion mechanism for this where users or applications just dump logfiles to S3 (and in fact I've already built & verified this), but I'm worried about the bursty nature of this data and how it might affect the other realtime things.
Vector is built to be fast, and for most applications "as fast as possible" or "as fast as the sinks will accept" is the answer. But I'm not finding many ways to slow things down, "smooth out" data delivery, or control flows from different sources (in an attempt to be "fair"). What happens if someone drops a 1TB log file to s3? It'll be read as fast as possible, and very possibly starve the data coming from the (arguably more important) realtime applications. And generally this data is going to be very bursty, in stark contrast to the rest of the system.
And to be fair, there are things I can build outside of vector to help with this (like an ingestion pipeline to kafka with intentional rate limits built by hand), and I could build a completely separate pipeline with different resource allocations (which might be a good idea anyways), but I just wanted to see if anyone else has had similar use cases. Fairness and flow control might be generally useful, and if I can keep everything inside vector, it makes things a lot simpler operationally.
Backpressure & adaptive concurrency are natural for vector, so it got me thinking along a few lines:
sleep
mechanism, I could sort of manually control flow through a remap transform by doing something like:throttle
transform might be a good home for this, with a "block" option, but seems really tricky becausethrottle
is (optionally) key-based, and blocking per key seems... impossible. But maybe somerate_limit
transform which just operates on the entire stream?Any other ideas? Thoughts? Am I way off base?
Beta Was this translation helpful? Give feedback.
All reactions