New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vector Pod is not recovering from brief network disruptions after restarting CNI Pod #20337
Comments
Thanks @hanley-patrick , that is some odd behavior. How long is forever? I wonder if we missing some timeout configuration on the AWS SDK operations that could be used to timeout and retry. I'm not seeing us using https://docs.rs/aws-config/latest/aws_config/timeout/struct.TimeoutConfig.html but I'm unsure what the defaults are (if any). |
I first noticed this problem on a Thursday. When I checked the logs (Log level was INFO at the time), the last log message was from Monday (when the CNI Pod restarted). 3 days went by without any messages in the logs, and no messages were being polled from SQS. So it seems like Vector gets permanently into a bad state, like it looses connection and never tries to reconnect. When I turn on Debug/Trace logging, I see a lot of messages (like the Vector Beep and utilization statistics) but nothing related to the S3 Source. So vector itself is still running and seems healthy, it just has given up on its S3 Source. |
Gotcha, yeah, I see. I'm not sure if it would necessarily fix it or not, but I think a first step would be to try to add timeouts to our use of the AWS SDK. |
I'm wondering if it's possibly related to hyperium/hyper#2312. From the logs, it seems like hyper just hangs. Given that restarting Vector fixes the issue, it would make sense if hyper is hanging on to a stale connection. Adding AWS timeouts as a first step makes sense. If that doesn't fix things, I'm not sure if there are other potential workarounds that could be done from the Vector side? |
@jszwedko this issue looks similar to the bug describe in #20017. Notably the bug report mentioned "this causes vector to stop polling from sqs (loop still waiting for a future to end before requesting again)" which is similar to what I am seeing. It looks like a potential fix is in progress (#20120) |
A note for the community
Problem
Vector is running in kubernetes. There is an S3
source
, which is notified by polling SQS. TheSink
is a local filesystem.Each node in the kubernetes cluster has a CNI networking pod running as an agent to handle network connectivity for pod-pod and pod->internet connectivity . When the CNI Pod running on the same host as Vector is restarted, the Vector pod has a brief blip in network connectivity (expected), but than it never recovers (unexpected). It stops polling SQS and reading S3. Nothing in the logs indicate there is any problem.
I am able to reproduce with Trace logging enabled. Before restarting the CNI pod, I see a lot of log messages that say ReceivingMessage from SQS. But after I restart the CNI pod, there are no log messages. There are no errors either.
I can monitor the packets being routed through the CNI Pod, and I don't see any coming from the Vector pod. It does not appear that the CNI pod is blocking or dropping any packets or blocking connectivity - it looks like Vector just isn't trying anymore.
Restarting the Vector Pod fixes it - Vector starts working properly again, and receives the SQS Messages it missed when it was not trying.
Have you seen anything like this before?
Configuration
Version
0.37.1
Debug Output
Example Data
No response
Additional Context
Vector is running in Kubernetes, with a CNI networking pod on each node.
References
No response
The text was updated successfully, but these errors were encountered: