Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We might benefit from a "network should continue working in cases that resemble syn_cookie or syn_flood" test... #120979

Open
jayunit100 opened this issue Oct 3, 2023 · 13 comments
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.

Comments

@jayunit100
Copy link
Member

What happened?

There are kernel bugs such as https://bugs.launchpad.net/ubuntu/+source/linux-hwe-5.4/+bug/1981658, which we've found on ubuntu versions, which only are visible if a node is pushed to the point that it "Switches" the way it processes packets, to the syn_cookie kernel path...

Now, Looking at https://access.redhat.com/solutions/30453, it doesnt look like we really want this to be a normal scenario......

        Note, that syncookies is fallback facility.
        It MUST NOT be used to help highly loaded servers to stand
        against legal connection rate. If you see SYN flood warnings
        in your logs, but investigation shows that they occur
        because of overload with legal connections, you should tune 
        another parameters until this warning disappear.
        See: tcp_max_syn_backlog, tcp_synack_retries, tcp_abort_on_overflow.

Status
I ran our existing e2es (iperf, networkpolicies, and sig-network tests) at high concurrency, and wasnt able to force a synflood/syncookie fallback at any point, but i know of clusters where kubelets have failed due to that path in the kernel.

Goal
Would be nice if we had a sig-network e2e that simulated this ? Feel free to close if one of the existing tests already does this.

This may or may not be possible. I THINK that the way to do this would be to fire off lots of TCP connections.... and somehow keep them open for a long time (i.e. maybe serve very large packets?)

What did you expect to happen?

Kubernetes sig-net or similar e2e's would be able to simulate overloaded service/endpoints where non-normal TCP connection handling start happening.

How can we reproduce it (as minimally and precisely as possible)?

Not sure, thats the purpose of this test :). But... I suppose we might be able to reproduce syn-floods if theres too many TCP asks coming in at agiven time, in parallel (see above linked RH article).

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
# paste output here

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@jayunit100 jayunit100 added the kind/bug Categorizes issue or PR as related to a bug. label Oct 3, 2023
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Oct 3, 2023
@jayunit100
Copy link
Member Author

/sig network

@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 3, 2023
@jayunit100
Copy link
Member Author

/good-first-issue

since this is experimental, im thinking it would be a good exploratory issue, even if all we did is write a shell script or something that simulated this as a blog post ... and didnt commit it to core k/k.....

@k8s-ci-robot
Copy link
Contributor

@jayunit100:
This request has been marked as suitable for new contributors.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

/good-first-issue

since this is experimental, im thinking it would be a good exploratory issue, even if all we did is write a shell script or something that simulated this as a blog post ... and didnt commit it to core k/k.....

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Oct 3, 2023
@jayunit100
Copy link
Member Author

/sig scalability

@k8s-ci-robot k8s-ci-robot added the sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. label Oct 3, 2023
@MeenuyD
Copy link

MeenuyD commented Oct 3, 2023

Hello @jayunit100 I would like to work on this issue

@jayunit100
Copy link
Member Author

Ok I suppose this would be experimental for now. If your able to build a cluster on Ubuntu kernel version 5.4.0-123 .... that might allow you to simulate this type of corner case.

Also need to see from others if they agree such a test would be worthwhile and if so wether k/k is the right place for it

@aojea
Copy link
Member

aojea commented Oct 3, 2023

the problem is that the test will depend a lot on the environment, and you may hit other bottlenecks or impact other things ... this can only run in a very specific and controller environment to be reliable

@jayunit100
Copy link
Member Author

Ya my initial attempt I hit ip exhaustion before I could hit a stack trace :). I'm ok closing this issue if folks feel like it's not reproducible .

I felt like maybe there's a clever way to do this that I haven't thought of though?

@uablrek
Copy link
Contributor

uablrek commented Oct 10, 2023

In go the parameter to "listen(2)" is taken from /proc/sys/net/core/somaxconn. A way to provoke overflow in go might be to set this value and tcp_max_syn_backlog to a low value and use a deliberate slow tcp server.

Still, as @aojea says, this depend a lot on the environment.

@jayunit100
Copy link
Member Author

jayunit100 commented Oct 27, 2023

yes. I don't think we need to go down to the level of reproducing it reliably. But.. . It feels like it would be nice to have a network flood of some sort . But maybe there's a combination of e2es which can do that as is?

I'm ok to open or close this one . I think leaving it open until the bot closes It is ok to In case someone decides they have an idea for it.

@anshikavashistha
Copy link

@jayunit100 May I work on this issue?

@jayunit100
Copy link
Member Author

jayunit100 commented Feb 23, 2024

Sure , but... see earlier comments ... as it will be hard to reproduce.

See if you can detect syn flood in your Ubuntu box by running something locally

If you can then next step would be to containerized it and see if you can get it to trigger when hitting a pod instead of hitting localhost.

I think it would be a good experiment to explore either way and write up the results here.

Lars has a idea that sounds like a good first pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet
Development

No branches or pull requests

6 participants