Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add concurrent batching to the file sink #20394

Open
fpytloun opened this issue Apr 29, 2024 · 3 comments
Open

Add concurrent batching to the file sink #20394

fpytloun opened this issue Apr 29, 2024 · 3 comments
Labels
domain: performance Anything related to Vector's performance sink: file Anything `file` sink related type: enhancement A value-adding code change that enhances its existing functionality.

Comments

@fpytloun
Copy link
Contributor

fpytloun commented Apr 29, 2024

A note for the community

  • Please vote on this issue by adding a 馃憤 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

I can see 100% utilization on file sink which then applies backpressure and slows-down whole pipeline. I am using tmpfs so disk is not a bottleneck but high cardinal partitioning could be. It seems that file sink is not batching concurrently and therefore applying backpressure quickly (especially with gzip compression).

Configuration

[sinks.out_kafka_access_file]
      type = "file"
      #inputs = ["throttle_kafka_access_tenant"]
      inputs = ["remap_kafka_access"]
      compression = "gzip"
      encoding.except_fields = ["_index", "_topic", "_topic_template", "_partition", "_offset", "_throttle_key", "_hash", "_alert", "_keep", "_sd", "_source", "_syslog_severity", "_file_suffix", 'kubernetes.labels."pod-template-hash"', "@source_type", "@metadata"]
      encoding.codec = "json"
      framing.method = "newline_delimited"
      # ._file_suffix = to_int(to_int(now()) / 300)
      path = "/var/lib/vector/s3sync/out_kafka_access_file/topics/{{ _topic }}/year=%Y/month=%m/day=%d/hour=%H/${HOSTNAME}.pa2-par-gc-int-ves-io_{{ _file_suffix }}.json.gz"
      idle_timeout_secs = 30
      buffer.type = "memory"
      buffer.max_events = 3000    # default 500 with memory buffer

Version

0.37.0

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

@fpytloun fpytloun added the type: bug A code related bug. label Apr 29, 2024
@jszwedko
Copy link
Member

Hi @fpytloun ,

Thanks for filing this! I think I'm missing what the specific request is though. Is it to generally improve the throughput of the file sink? Or to make some specific modification to it? I can think of a few modifications that could be made:

@jszwedko jszwedko added type: enhancement A value-adding code change that enhances its existing functionality. sink: file Anything `file` sink related domain: performance Anything related to Vector's performance and removed type: bug A code related bug. labels Apr 29, 2024
@fpytloun
Copy link
Contributor Author

fpytloun commented May 1, 2024

Hello @jszwedko, so I was thinking of this as a bug because this component which should be very simple can easily apply backpressure (even when using tmpfs) and limit more complex components (Kafka, Elasticsearch, Clickhouse, etc.) that can however handle much higher throughput.

I think batching and concurrency is what would fix it and I currently don't know of any workaround. Possibly use aws_s3 now which will consume lot of memory due to in-memory batching but probably same amount as I am already using tmpfs with file sink.

@jszwedko
Copy link
Member

jszwedko commented May 1, 2024

Hello @jszwedko, so I was thinking of this as a bug because this component which should be very simple can easily apply backpressure (even when using tmpfs) and limit more complex components (Kafka, Elasticsearch, Clickhouse, etc.) that can however handle much higher throughput.

I think batching and concurrency is what would fix it and I currently don't know of any workaround. Possibly use aws_s3 now which will consume lot of memory due to in-memory batching but probably same amount as I am already using tmpfs with file sink.

Makes sense, thanks for confirming. I'll adjust this issue to be about adding concurrent batching to the sink.

@jszwedko jszwedko changed the title File sink 100% utilization, no concurrency? Add concurrent batching to the file sink May 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: performance Anything related to Vector's performance sink: file Anything `file` sink related type: enhancement A value-adding code change that enhances its existing functionality.
Projects
None yet
Development

No branches or pull requests

2 participants