Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector Pod is not recovering from brief network disruptions after restarting CNI Pod #20337

Open
hanley-patrick opened this issue Apr 18, 2024 · 5 comments
Labels
provider: aws Anything `aws` service provider related type: bug A code related bug.

Comments

@hanley-patrick
Copy link

hanley-patrick commented Apr 18, 2024

A note for the community

  • Please vote on this issue by adding a 馃憤 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Vector is running in kubernetes. There is an S3 source, which is notified by polling SQS. The Sink is a local filesystem.

Each node in the kubernetes cluster has a CNI networking pod running as an agent to handle network connectivity for pod-pod and pod->internet connectivity . When the CNI Pod running on the same host as Vector is restarted, the Vector pod has a brief blip in network connectivity (expected), but than it never recovers (unexpected). It stops polling SQS and reading S3. Nothing in the logs indicate there is any problem.

I am able to reproduce with Trace logging enabled. Before restarting the CNI pod, I see a lot of log messages that say ReceivingMessage from SQS. But after I restart the CNI pod, there are no log messages. There are no errors either.

I can monitor the packets being routed through the CNI Pod, and I don't see any coming from the Vector pod. It does not appear that the CNI pod is blocking or dropping any packets or blocking connectivity - it looks like Vector just isn't trying anymore.

Restarting the Vector Pod fixes it - Vector starts working properly again, and receives the SQS Messages it missed when it was not trying.

Have you seen anything like this before?

Configuration

sinks:
  logs_s3_to_fs:
    encoding:
      codec: text
    inputs:
    - set_log_file_and_message
    path: '{{ filename }}'
    type: file
sources:
  my_s3_source:
    region: us-east-1
    sqs:
      delete_message: true
      poll_secs: 5
      queue_url: my-queue-url
    strategy: sqs
    type: aws_s3
transforms:
  fetch_nested_log_message:
    inputs:
    - remap_to_json
    source: source logic
    type: remap
  format_log_message:
    inputs:
    - fetch_nested_log_message
    source: |-
      source logic
    type: remap
  remap_to_json:
    inputs:
    - my_s3_source
    source: source logic
    type: remap
  set_log_file_and_message:
    inputs:
    - format_log_message
    source: |-
      source logic
    type: remap

Version

0.37.1

Debug Output

one error in the logs around when the CNI Pod restarted...
2024-04-18T19:10:34.363359Z ERROR source{component_kind="source" component_id=my-s3-source component_type=aws_s3}: vector::internal_events::aws_sqs::s3: Failed to process SQS message. message_id=1234-asdf-qwert error=Failed to fetch s3://my-s3-bucket/my-key: dispatch failure error_code="failed_processing_sqs_message" error_type="parser_failed" stage="processing" internal_log_rate_limit=true



when healthy, there are lots of debug messages involving SQS. Then these log messages stop abruptly without any messages.
2024-04-18T19:10:54.702897Z TRACE source{component_kind="source" component_id=my-s3-source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage}:try_op:try_attempt: aws_smithy_runtime::client::interceptors: running `read_after_signing` interceptors
2024-04-18T19:10:54.702929Z TRACE source{component_kind="source" component_id=my-s3-source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage}:try_op:try_attempt: aws_smithy_runtime::client::interceptors: running `modify_before_transmit` interceptors
2024-04-18T19:10:54.702955Z TRACE source{component_kind="source" component_id=my-s3-source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage}:try_op:try_attempt: aws_smithy_runtime::client::interceptors: running `read_before_transmit` interceptors
2024-04-18T19:10:54.702973Z DEBUG source{component_kind="source" component_id=my-s3-source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage}:try_op:try_attempt: aws_smithy_runtime_api::client::interceptors::context: entering 'transmit' phase
2024-04-18T19:10:54.703093Z TRACE source{component_kind="source" component_id=my-s3-source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage}:try_op:try_attempt: hyper::client::pool: take? ("https", sqs.us-east-1.amazonaws.com): expiration = Some(90s)
2024-04-18T19:10:54.703114Z DEBUG source{component_kind="source" component_id=my-s3-source component_type=aws_s3}:invoke{service=sqs operation=ReceiveMessage}:try_op:try_attempt: hyper::client::pool: reuse idle connection for ("https", sqs.us-east-1.amazonaws.com)
2024-04-18T19:10:54.703183Z TRACE encode_headers: hyper::proto::h1::role: Client::encode method=POST, body=Some(Unknown)
2024-04-18T19:10:54.703210Z TRACE hyper::proto::h1::encode: sized write, len = 165
2024-04-18T19:10:54.703301Z TRACE hyper::proto::h1::io: buffer.flatten self.len=1810 buf.len=165
2024-04-18T19:10:54.703324Z TRACE hyper::proto::h1::dispatch: no more write body allowed, user body is_end_stream = false
2024-04-18T19:10:54.703420Z DEBUG hyper::proto::h1::io: flushed 1975 bytes
2024-04-18T19:10:54.703468Z TRACE hyper::proto::h1::conn: flushed({role=client}): State { reading: Init, writing: KeepAlive, keep_alive: Busy }
2024-04-18T19:10:55.529050Z TRACE vector: Beep.
2024-04-18T19:10:56.531717Z TRACE vector: Beep.
2024-04-18T19:10:57.526838Z TRACE vector: Beep.
2024-04-18T19:10:57.528221Z DEBUG sink{component_kind="sink" component_id=logs_s3_to_fs component_type=file}: vector::utilization: utilization=0.000000032650971543317326
2024-04-18T19:10:58.526398Z TRACE vector: Beep.
2024-04-18T19:10:59.526552Z TRACE vector: Beep.
2024-04-18T19:11:00.527036Z TRACE vector: Beep.
2024-04-18T19:11:01.527976Z TRACE vector: Beep.
2024-04-18T19:11:02.527171Z TRACE vector: Beep.



The "Vector Utilization" and "Vector Beep" messages repeat indefinitely. No more messages about my Source, or SQS or S3 appear.

Example Data

No response

Additional Context

Vector is running in Kubernetes, with a CNI networking pod on each node.

References

No response

@hanley-patrick hanley-patrick added the type: bug A code related bug. label Apr 18, 2024
@jszwedko
Copy link
Member

Thanks @hanley-patrick , that is some odd behavior. How long is forever? I wonder if we missing some timeout configuration on the AWS SDK operations that could be used to timeout and retry. I'm not seeing us using https://docs.rs/aws-config/latest/aws_config/timeout/struct.TimeoutConfig.html but I'm unsure what the defaults are (if any).

@hanley-patrick
Copy link
Author

I first noticed this problem on a Thursday. When I checked the logs (Log level was INFO at the time), the last log message was from Monday (when the CNI Pod restarted). 3 days went by without any messages in the logs, and no messages were being polled from SQS. So it seems like Vector gets permanently into a bad state, like it looses connection and never tries to reconnect.

When I turn on Debug/Trace logging, I see a lot of messages (like the Vector Beep and utilization statistics) but nothing related to the S3 Source. So vector itself is still running and seems healthy, it just has given up on its S3 Source.

@jszwedko
Copy link
Member

Gotcha, yeah, I see. I'm not sure if it would necessarily fix it or not, but I think a first step would be to try to add timeouts to our use of the AWS SDK.

@jszwedko jszwedko added the provider: aws Anything `aws` service provider related label Apr 19, 2024
@jennie-ju
Copy link

I'm wondering if it's possibly related to hyperium/hyper#2312. From the logs, it seems like hyper just hangs. Given that restarting Vector fixes the issue, it would make sense if hyper is hanging on to a stale connection.

Adding AWS timeouts as a first step makes sense. If that doesn't fix things, I'm not sure if there are other potential workarounds that could be done from the Vector side?

@hanley-patrick
Copy link
Author

@jszwedko this issue looks similar to the bug describe in #20017. Notably the bug report mentioned "this causes vector to stop polling from sqs (loop still waiting for a future to end before requesting again)" which is similar to what I am seeing.

It looks like a potential fix is in progress (#20120)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
provider: aws Anything `aws` service provider related type: bug A code related bug.
Projects
None yet
Development

No branches or pull requests

3 participants