Vector Pod is not recovering from brief network disruptions after restarting CNI Pod #20337

hanley-patrick · 2024-04-18T20:31:23Z

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Vector is running in kubernetes. There is an S3 source, which is notified by polling SQS. The Sink is a local filesystem.

Each node in the kubernetes cluster has a CNI networking pod running as an agent to handle network connectivity for pod-pod and pod->internet connectivity . When the CNI Pod running on the same host as Vector is restarted, the Vector pod has a brief blip in network connectivity (expected), but than it never recovers (unexpected). It stops polling SQS and reading S3. Nothing in the logs indicate there is any problem.

I am able to reproduce with Trace logging enabled. Before restarting the CNI pod, I see a lot of log messages that say ReceivingMessage from SQS. But after I restart the CNI pod, there are no log messages. There are no errors either.

I can monitor the packets being routed through the CNI Pod, and I don't see any coming from the Vector pod. It does not appear that the CNI pod is blocking or dropping any packets or blocking connectivity - it looks like Vector just isn't trying anymore.

Restarting the Vector Pod fixes it - Vector starts working properly again, and receives the SQS Messages it missed when it was not trying.

Have you seen anything like this before?

Configuration

sinks:
  logs_s3_to_fs:
    encoding:
      codec: text
    inputs:
    - set_log_file_and_message
    path: '{{ filename }}'
    type: file
sources:
  my_s3_source:
    region: us-east-1
    sqs:
      delete_message: true
      poll_secs: 5
      queue_url: my-queue-url
    strategy: sqs
    type: aws_s3
transforms:
  fetch_nested_log_message:
    inputs:
    - remap_to_json
    source: source logic
    type: remap
  format_log_message:
    inputs:
    - fetch_nested_log_message
    source: |-
      source logic
    type: remap
  remap_to_json:
    inputs:
    - my_s3_source
    source: source logic
    type: remap
  set_log_file_and_message:
    inputs:
    - format_log_message
    source: |-
      source logic
    type: remap

Version

0.37.1

Debug Output

one error in the logs around when the CNI Pod restarted...
2024-04-18T19:10:34.363359Z ERROR source{component_kind="source" component_id=my-s3-source component_type=aws_s3}: vector::internal_events::aws_sqs::s3: Failed to process SQS message. message_id=1234-asdf-qwert error=Failed to fetch s3://my-s3-bucket/my-key: dispatch failure error_code="failed_processing_sqs_message" error_type="parser_failed" stage="processing" internal_log_rate_limit=true



when healthy, there are lots of debug messages involving SQS. Then these log messages stop abruptly without any messages.
2024-04-18T19:10:54.702897Z TRACE source{component_kind="source" component_id=my-s3-source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage}:try_op:try_attempt: aws_smithy_runtime::client::interceptors: running `read_after_signing` interceptors
2024-04-18T19:10:54.702929Z TRACE source{component_kind="source" component_id=my-s3-source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage}:try_op:try_attempt: aws_smithy_runtime::client::interceptors: running `modify_before_transmit` interceptors
2024-04-18T19:10:54.702955Z TRACE source{component_kind="source" component_id=my-s3-source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage}:try_op:try_attempt: aws_smithy_runtime::client::interceptors: running `read_before_transmit` interceptors
2024-04-18T19:10:54.702973Z DEBUG source{component_kind="source" component_id=my-s3-source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage}:try_op:try_attempt: aws_smithy_runtime_api::client::interceptors::context: entering 'transmit' phase
2024-04-18T19:10:54.703093Z TRACE source{component_kind="source" component_id=my-s3-source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage}:try_op:try_attempt: hyper::client::pool: take? ("https", sqs.us-east-1.amazonaws.com): expiration = Some(90s)
2024-04-18T19:10:54.703114Z DEBUG source{component_kind="source" component_id=my-s3-source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage}:try_op:try_attempt: hyper::client::pool: reuse idle connection for ("https", sqs.us-east-1.amazonaws.com)
2024-04-18T19:10:54.703183Z TRACE encode_headers: hyper::proto::h1::role: Client::encode method=POST, body=Some(Unknown)
2024-04-18T19:10:54.703210Z TRACE hyper::proto::h1::encode: sized write, len = 165
2024-04-18T19:10:54.703301Z TRACE hyper::proto::h1::io: buffer.flatten self.len=1810 buf.len=165
2024-04-18T19:10:54.703324Z TRACE hyper::proto::h1::dispatch: no more write body allowed, user body is_end_stream = false
2024-04-18T19:10:54.703420Z DEBUG hyper::proto::h1::io: flushed 1975 bytes
2024-04-18T19:10:54.703468Z TRACE hyper::proto::h1::conn: flushed({role=client}): State { reading: Init, writing: KeepAlive, keep_alive: Busy }
2024-04-18T19:10:55.529050Z TRACE vector: Beep.
2024-04-18T19:10:56.531717Z TRACE vector: Beep.
2024-04-18T19:10:57.526838Z TRACE vector: Beep.
2024-04-18T19:10:57.528221Z DEBUG sink{component_kind="sink" component_id=logs_s3_to_fs component_type=file}: vector::utilization: utilization=0.000000032650971543317326
2024-04-18T19:10:58.526398Z TRACE vector: Beep.
2024-04-18T19:10:59.526552Z TRACE vector: Beep.
2024-04-18T19:11:00.527036Z TRACE vector: Beep.
2024-04-18T19:11:01.527976Z TRACE vector: Beep.
2024-04-18T19:11:02.527171Z TRACE vector: Beep.



The "Vector Utilization" and "Vector Beep" messages repeat indefinitely. No more messages about my Source, or SQS or S3 appear.

Example Data

No response

Additional Context

Vector is running in Kubernetes, with a CNI networking pod on each node.

References

No response

The text was updated successfully, but these errors were encountered:

jszwedko · 2024-04-18T23:43:33Z

Thanks @hanley-patrick , that is some odd behavior. How long is forever? I wonder if we missing some timeout configuration on the AWS SDK operations that could be used to timeout and retry. I'm not seeing us using https://docs.rs/aws-config/latest/aws_config/timeout/struct.TimeoutConfig.html but I'm unsure what the defaults are (if any).

hanley-patrick · 2024-04-19T12:34:49Z

I first noticed this problem on a Thursday. When I checked the logs (Log level was INFO at the time), the last log message was from Monday (when the CNI Pod restarted). 3 days went by without any messages in the logs, and no messages were being polled from SQS. So it seems like Vector gets permanently into a bad state, like it looses connection and never tries to reconnect.

When I turn on Debug/Trace logging, I see a lot of messages (like the Vector Beep and utilization statistics) but nothing related to the S3 Source. So vector itself is still running and seems healthy, it just has given up on its S3 Source.

jszwedko · 2024-04-19T19:45:33Z

Gotcha, yeah, I see. I'm not sure if it would necessarily fix it or not, but I think a first step would be to try to add timeouts to our use of the AWS SDK.

jennie-ju · 2024-04-23T14:52:58Z

I'm wondering if it's possibly related to hyperium/hyper#2312. From the logs, it seems like hyper just hangs. Given that restarting Vector fixes the issue, it would make sense if hyper is hanging on to a stale connection.

Adding AWS timeouts as a first step makes sense. If that doesn't fix things, I'm not sure if there are other potential workarounds that could be done from the Vector side?

hanley-patrick · 2024-04-29T17:51:35Z

@jszwedko this issue looks similar to the bug describe in #20017. Notably the bug report mentioned "this causes vector to stop polling from sqs (loop still waiting for a future to end before requesting again)" which is similar to what I am seeing.

It looks like a potential fix is in progress (#20120)

hanley-patrick added the type: bug A code related bug. label Apr 18, 2024

jszwedko added the provider: aws Anything `aws` service provider related label Apr 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vector Pod is not recovering from brief network disruptions after restarting CNI Pod #20337

Vector Pod is not recovering from brief network disruptions after restarting CNI Pod #20337

hanley-patrick commented Apr 18, 2024 •

edited

jszwedko commented Apr 18, 2024

hanley-patrick commented Apr 19, 2024

jszwedko commented Apr 19, 2024

jennie-ju commented Apr 23, 2024

hanley-patrick commented Apr 29, 2024

Vector Pod is not recovering from brief network disruptions after restarting CNI Pod #20337

Vector Pod is not recovering from brief network disruptions after restarting CNI Pod #20337

Comments

hanley-patrick commented Apr 18, 2024 • edited

A note for the community

Problem

Configuration

Version

Debug Output

Example Data

Additional Context

References

jszwedko commented Apr 18, 2024

hanley-patrick commented Apr 19, 2024

jszwedko commented Apr 19, 2024

jennie-ju commented Apr 23, 2024

hanley-patrick commented Apr 29, 2024

hanley-patrick commented Apr 18, 2024 •

edited