We might benefit from a "network should continue working in cases that resemble syn_cookie or syn_flood" test... #120979

jayunit100 · 2023-10-03T11:03:43Z

What happened?

There are kernel bugs such as https://bugs.launchpad.net/ubuntu/+source/linux-hwe-5.4/+bug/1981658, which we've found on ubuntu versions, which only are visible if a node is pushed to the point that it "Switches" the way it processes packets, to the syn_cookie kernel path...

Now, Looking at https://access.redhat.com/solutions/30453, it doesnt look like we really want this to be a normal scenario......

        Note, that syncookies is fallback facility.
        It MUST NOT be used to help highly loaded servers to stand
        against legal connection rate. If you see SYN flood warnings
        in your logs, but investigation shows that they occur
        because of overload with legal connections, you should tune 
        another parameters until this warning disappear.
        See: tcp_max_syn_backlog, tcp_synack_retries, tcp_abort_on_overflow.

Status
I ran our existing e2es (iperf, networkpolicies, and sig-network tests) at high concurrency, and wasnt able to force a synflood/syncookie fallback at any point, but i know of clusters where kubelets have failed due to that path in the kernel.

Goal
Would be nice if we had a sig-network e2e that simulated this ? Feel free to close if one of the existing tests already does this.

This may or may not be possible. I THINK that the way to do this would be to fire off lots of TCP connections.... and somehow keep them open for a long time (i.e. maybe serve very large packets?)

What did you expect to happen?

Kubernetes sig-net or similar e2e's would be able to simulate overloaded service/endpoints where non-normal TCP connection handling start happening.

How can we reproduce it (as minimally and precisely as possible)?

Not sure, thats the purpose of this test :). But... I suppose we might be able to reproduce syn-floods if theres too many TCP asks coming in at agiven time, in parallel (see above linked RH article).

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
# paste output here

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

The text was updated successfully, but these errors were encountered:

jayunit100 · 2023-10-03T11:03:51Z

/sig network

k8s-ci-robot · 2023-10-03T11:03:51Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jayunit100 · 2023-10-03T11:04:29Z

/good-first-issue

since this is experimental, im thinking it would be a good exploratory issue, even if all we did is write a shell script or something that simulated this as a blog post ... and didnt commit it to core k/k.....

k8s-ci-robot · 2023-10-03T11:04:30Z

@jayunit100:
This request has been marked as suitable for new contributors.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

/good-first-issue

since this is experimental, im thinking it would be a good exploratory issue, even if all we did is write a shell script or something that simulated this as a blog post ... and didnt commit it to core k/k.....

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jayunit100 · 2023-10-03T11:04:41Z

/sig scalability

MeenuyD · 2023-10-03T12:26:47Z

Hello @jayunit100 I would like to work on this issue

jayunit100 · 2023-10-03T13:40:59Z

Ok I suppose this would be experimental for now. If your able to build a cluster on Ubuntu kernel version 5.4.0-123 .... that might allow you to simulate this type of corner case.

Also need to see from others if they agree such a test would be worthwhile and if so wether k/k is the right place for it

aojea · 2023-10-03T20:01:40Z

the problem is that the test will depend a lot on the environment, and you may hit other bottlenecks or impact other things ... this can only run in a very specific and controller environment to be reliable

jayunit100 · 2023-10-03T20:59:50Z

Ya my initial attempt I hit ip exhaustion before I could hit a stack trace :). I'm ok closing this issue if folks feel like it's not reproducible .

I felt like maybe there's a clever way to do this that I haven't thought of though?

uablrek · 2023-10-10T10:11:59Z

In go the parameter to "listen(2)" is taken from /proc/sys/net/core/somaxconn. A way to provoke overflow in go might be to set this value and tcp_max_syn_backlog to a low value and use a deliberate slow tcp server.

Still, as @aojea says, this depend a lot on the environment.

jayunit100 · 2023-10-27T15:19:32Z

yes. I don't think we need to go down to the level of reproducing it reliably. But.. . It feels like it would be nice to have a network flood of some sort . But maybe there's a combination of e2es which can do that as is?

I'm ok to open or close this one . I think leaving it open until the bot closes It is ok to In case someone decides they have an idea for it.

anshikavashistha · 2024-01-01T09:15:31Z

@jayunit100 May I work on this issue?

jayunit100 · 2024-02-23T06:46:56Z

Sure , but... see earlier comments ... as it will be hard to reproduce.

See if you can detect syn flood in your Ubuntu box by running something locally

If you can then next step would be to containerized it and see if you can get it to trigger when hitting a pod instead of hitting localhost.

I think it would be a good experiment to explore either way and write up the results here.

Lars has a idea that sounds like a good first pass.

jayunit100 added the kind/bug Categorizes issue or PR as related to a bug. label Oct 3, 2023

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Oct 3, 2023

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 3, 2023

k8s-ci-robot added good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Oct 3, 2023

k8s-ci-robot added the sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. label Oct 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

We might benefit from a "network should continue working in cases that resemble syn_cookie or syn_flood" test... #120979

We might benefit from a "network should continue working in cases that resemble syn_cookie or syn_flood" test... #120979

jayunit100 commented Oct 3, 2023

jayunit100 commented Oct 3, 2023

k8s-ci-robot commented Oct 3, 2023

jayunit100 commented Oct 3, 2023

k8s-ci-robot commented Oct 3, 2023

jayunit100 commented Oct 3, 2023

MeenuyD commented Oct 3, 2023

jayunit100 commented Oct 3, 2023

aojea commented Oct 3, 2023

jayunit100 commented Oct 3, 2023

uablrek commented Oct 10, 2023

jayunit100 commented Oct 27, 2023 •

edited

anshikavashistha commented Jan 1, 2024

jayunit100 commented Feb 23, 2024 •

edited

We might benefit from a "network should continue working in cases that resemble syn_cookie or syn_flood" test... #120979

We might benefit from a "network should continue working in cases that resemble syn_cookie or syn_flood" test... #120979

Comments

jayunit100 commented Oct 3, 2023

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

jayunit100 commented Oct 3, 2023

k8s-ci-robot commented Oct 3, 2023

jayunit100 commented Oct 3, 2023

k8s-ci-robot commented Oct 3, 2023

Guidelines

jayunit100 commented Oct 3, 2023

MeenuyD commented Oct 3, 2023

jayunit100 commented Oct 3, 2023

aojea commented Oct 3, 2023

jayunit100 commented Oct 3, 2023

uablrek commented Oct 10, 2023

jayunit100 commented Oct 27, 2023 • edited

anshikavashistha commented Jan 1, 2024

jayunit100 commented Feb 23, 2024 • edited

jayunit100 commented Oct 27, 2023 •

edited

jayunit100 commented Feb 23, 2024 •

edited