New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add concurrent batching to the file
sink
#20394
Comments
Hi @fpytloun , Thanks for filing this! I think I'm missing what the specific request is though. Is it to generally improve the throughput of the
|
Hello @jszwedko, so I was thinking of this as a bug because this component which should be very simple can easily apply backpressure (even when using tmpfs) and limit more complex components (Kafka, Elasticsearch, Clickhouse, etc.) that can however handle much higher throughput. I think batching and concurrency is what would fix it and I currently don't know of any workaround. Possibly use aws_s3 now which will consume lot of memory due to in-memory batching but probably same amount as I am already using tmpfs with file sink. |
Makes sense, thanks for confirming. I'll adjust this issue to be about adding concurrent batching to the sink. |
file
sink
A note for the community
Problem
I can see 100% utilization on file sink which then applies backpressure and slows-down whole pipeline. I am using tmpfs so disk is not a bottleneck but high cardinal partitioning could be. It seems that file sink is not batching concurrently and therefore applying backpressure quickly (especially with gzip compression).
Configuration
Version
0.37.0
Debug Output
No response
Example Data
No response
Additional Context
No response
References
No response
The text was updated successfully, but these errors were encountered: