Add concurrent batching to the `file` sink #20394

fpytloun · 2024-04-29T09:19:47Z

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

I can see 100% utilization on file sink which then applies backpressure and slows-down whole pipeline. I am using tmpfs so disk is not a bottleneck but high cardinal partitioning could be. It seems that file sink is not batching concurrently and therefore applying backpressure quickly (especially with gzip compression).

Configuration

[sinks.out_kafka_access_file]
      type = "file"
      #inputs = ["throttle_kafka_access_tenant"]
      inputs = ["remap_kafka_access"]
      compression = "gzip"
      encoding.except_fields = ["_index", "_topic", "_topic_template", "_partition", "_offset", "_throttle_key", "_hash", "_alert", "_keep", "_sd", "_source", "_syslog_severity", "_file_suffix", 'kubernetes.labels."pod-template-hash"', "@source_type", "@metadata"]
      encoding.codec = "json"
      framing.method = "newline_delimited"
      # ._file_suffix = to_int(to_int(now()) / 300)
      path = "/var/lib/vector/s3sync/out_kafka_access_file/topics/{{ _topic }}/year=%Y/month=%m/day=%d/hour=%H/${HOSTNAME}.pa2-par-gc-int-ves-io_{{ _file_suffix }}.json.gz"
      idle_timeout_secs = 30
      buffer.type = "memory"
      buffer.max_events = 3000    # default 500 with memory buffer

Version

0.37.0

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

The text was updated successfully, but these errors were encountered:

jszwedko · 2024-04-29T21:12:00Z

Hi @fpytloun ,

Thanks for filing this! I think I'm missing what the specific request is though. Is it to generally improve the throughput of the file sink? Or to make some specific modification to it? I can think of a few modifications that could be made:

Add batching (right now, no batching occurs) (related: Add an option for setting maximal flush interval to the file sink #2174)
Process batches concurrently (requires batching first)

fpytloun · 2024-05-01T07:15:28Z

Hello @jszwedko, so I was thinking of this as a bug because this component which should be very simple can easily apply backpressure (even when using tmpfs) and limit more complex components (Kafka, Elasticsearch, Clickhouse, etc.) that can however handle much higher throughput.

I think batching and concurrency is what would fix it and I currently don't know of any workaround. Possibly use aws_s3 now which will consume lot of memory due to in-memory batching but probably same amount as I am already using tmpfs with file sink.

jszwedko · 2024-05-01T21:50:30Z

Hello @jszwedko, so I was thinking of this as a bug because this component which should be very simple can easily apply backpressure (even when using tmpfs) and limit more complex components (Kafka, Elasticsearch, Clickhouse, etc.) that can however handle much higher throughput.

I think batching and concurrency is what would fix it and I currently don't know of any workaround. Possibly use aws_s3 now which will consume lot of memory due to in-memory batching but probably same amount as I am already using tmpfs with file sink.

Makes sense, thanks for confirming. I'll adjust this issue to be about adding concurrent batching to the sink.

fpytloun added the type: bug A code related bug. label Apr 29, 2024

jszwedko added type: enhancement A value-adding code change that enhances its existing functionality. sink: file Anything `file` sink related domain: performance Anything related to Vector's performance and removed type: bug A code related bug. labels Apr 29, 2024

jszwedko changed the title ~~File sink 100% utilization, no concurrency?~~ Add concurrent batching to the file sink May 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add concurrent batching to the `file` sink #20394

Add concurrent batching to the `file` sink #20394

fpytloun commented Apr 29, 2024 •

edited

jszwedko commented Apr 29, 2024

fpytloun commented May 1, 2024 •

edited

jszwedko commented May 1, 2024

Add concurrent batching to the file sink #20394

Add concurrent batching to the file sink #20394

Comments

fpytloun commented Apr 29, 2024 • edited

A note for the community

Problem

Configuration

Version

Debug Output

Example Data

Additional Context

References

jszwedko commented Apr 29, 2024

fpytloun commented May 1, 2024 • edited

jszwedko commented May 1, 2024

Add concurrent batching to the `file` sink #20394

Add concurrent batching to the `file` sink #20394

fpytloun commented Apr 29, 2024 •

edited

fpytloun commented May 1, 2024 •

edited