Node lifecycle controller does not `markPodsNotReady` when the node `Ready` state changes from `false` to `unknown` #112733

xenv · 2022-09-26T12:05:17Z

What happened?

When kubelet loses connect, the node goes into the unknown state. The node lifecycle controller marks the pod as not ready by the markPodsNotReady function because the health check status of the pod can not be obtained through kubelet. This feature is available only when node's Ready state transitions from true to unknown.

However, if the node is already in the fail state (such as a containerd failure), markPodsNotReady will not take effect if the node loses its connection at this time.

kubernetes/pkg/controller/nodelifecycle/node_lifecycle_controller.go

Lines 883 to 888 in cac5388

 case currentReadyCondition.Status != v1.ConditionTrue && observedReadyCondition.Status == v1.ConditionTrue: 

 // Report node event only once when status changed. 

 controllerutil.RecordNodeStatusChange(nc.recorder, node, "NodeNotReady") 

 fallthrough 

 case needsRetry && observedReadyCondition.Status != v1.ConditionTrue: 

 if err = controllerutil.MarkPodsNotReady(ctx, nc.kubeClient, nc.recorder, pods, node.Name); err != nil {

In this case, the pod may accidentally remain ready, which may cause some network traffic to be accidentally forwarded to this node.

What did you expect to happen?

As long as the node loses its connection beyond grace time, MarkPodsNotReady should always work

How can we reproduce it (as minimally and precisely as possible)?

Stop containerd and wait for the node Ready state to false
Stop kubelet or shutdown the node and wait the node Ready state to unknown
The pods which not be evicted on this node would be always ready

Anything else we need to know?

In the node lifecycle controller logic,MarkPodsNotReady is just triggered when a node goes from true state to an unknown state. The correct way is to trigger when the node becomes unknown state regardless of whether the node state was previously true

Kubernetes version

$ kubectl version
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.15", GitCommit:"1d79bc3bcccfba7466c44cc2055d6e7442e140ea", GitTreeState:"clean", BuildDate:"2022-09-22T06:03:36Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release

$ uname -a
5.4.119-1-tlinux4-0008 #1 SMP Fri Nov 26 11:17:45 CST 2021 x86_64 x86_64 x86_64 GNU/Linux

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

The text was updated successfully, but these errors were encountered:

answer1991 · 2022-09-26T12:18:43Z

/sig node

bobbypage · 2022-09-26T20:16:31Z

Thanks for the report and the simple repro steps. This sounds like a generalized duplicate of #109998

/triage accepted

akankshakumari393 · 2022-09-28T07:10:36Z

/assign

akankshakumari393 · 2022-09-28T07:18:32Z

Is this a good-first-issue. I would like to work on it?

SergeyKanzhelev · 2022-09-28T20:30:26Z

/triage accepted

SergeyKanzhelev · 2022-09-28T20:32:11Z

Is this a good-first-issue. I would like to work on it?

Yes, I believe the change required is very localized.

akankshakumari393 · 2022-09-29T12:39:04Z

/good-first-issue

k8s-ci-robot · 2022-09-29T12:39:05Z

@akankshakumari393:
This request has been marked as suitable for new contributors.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

MustkimKhatik · 2022-10-02T13:14:54Z

Hey, I would like to work on this

akankshakumari393 · 2022-10-02T13:37:16Z

@MustkimKhatik i am already working on it, and have raised a PR as well.🙂

MustkimKhatik · 2022-10-02T13:38:49Z

@MustkimKhatik i am already working on it, and have raised a PR as well.🙂

Oh alright

answer1991 · 2022-10-08T06:55:38Z

Is this a good-first-issue. I would like to work on it?

@akankshakumari393 Hi, I think node-controller has some complex logic, it's not suitable for a new contributor. I have reviewed your PR and found there are some broken UT. Please close your PR, @xenv and me were trying to fix this issue.

/assign

maan19 · 2022-11-30T02:14:39Z

/assign

not sure if this has been fixed already. looks like it is not.

aojea · 2022-12-02T11:03:49Z

there are 3 PRs targeting this bug, please coordinate with the reviewers to avoid duplicating efforts

ashutosh887 · 2023-03-15T03:30:48Z

@aojea Does this need some work?

lance5890 · 2023-05-06T06:38:10Z

I have found this problem in node_lifycycle_controller : when node is changed from ready to not ready, the ds pod is still in ready status, and will not be removed in the ep

dsxing · 2023-05-24T02:30:02Z

/assign

utkarsh-singh1 · 2024-01-30T11:00:12Z

This issue, is still under consideration for good-first-issue, as it is reported here ⇾

Hi, I think node-controller has some complex logic, it's not suitable for a new contributor. I have reviewed your PR and found there are some broken UT. Please close your PR, @xenv and me were trying to fix this issue.

yylt · 2024-05-10T03:56:27Z

/assign

yylt · 2024-05-10T04:00:53Z

Hi all.

Although I know there have been many commits and I am aware of the earliest issue, it has not been fixed yet. I am truly sorry for the inconvenience.

Therefore, I have made a new commit in an attempt to fix the issue.

xenv added the kind/bug Categorizes issue or PR as related to a bug. label Sep 26, 2022

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 26, 2022

xenv changed the title ~~Node lifecycle controller does not markPodsNotReady when the node Ready state changes from fail to unknown~~ Node lifecycle controller does not markPodsNotReady when the node Ready state changes from false to unknown Sep 26, 2022

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 26, 2022

bobbypage mentioned this issue Sep 26, 2022

endpoint not update even if pod had deleted when node shutdown #109998

Open

k8s-ci-robot assigned akankshakumari393 Sep 28, 2022

SergeyKanzhelev added this to Triage in SIG Node Bugs Sep 28, 2022

SergeyKanzhelev moved this from Triage to Triaged in SIG Node Bugs Sep 28, 2022

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 28, 2022

k8s-ci-robot added good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Sep 29, 2022

akankshakumari393 mentioned this issue Sep 29, 2022

Mark NodeNotReady if it goes to unknown state regardless of its previous state in case of failure #112791

Closed

k8s-ci-robot assigned answer1991 Oct 8, 2022

xenv mentioned this issue Oct 8, 2022

Mark pods not ready if the node changes from failed state to unknown state #112928

Closed

maan19 mentioned this issue Nov 30, 2022

mark-pods-as-notready-if-node-goes-false-to-unknown #114200

Closed

This was referenced May 17, 2023

Mark pods not ready while the node state changes to NotReady #118067

Closed

Mark pods not ready while the node state changes to NotReady #118072

Closed

k8s-ci-robot assigned dsxing May 24, 2023

dsxing removed their assignment May 24, 2023

k8s-ci-robot assigned dsxing May 24, 2023

k8s-ci-robot assigned pradeeptosarkar Oct 9, 2023

pradeeptosarkar removed their assignment Oct 9, 2023

zxh326 mentioned this issue Oct 10, 2023

daemonset pods not recognising that nodes are not ready, pod stays running. #121100

Closed

yylt mentioned this issue May 10, 2024

Mark pods not ready if the node changes from false to unknown status #124782

Open

k8s-ci-robot assigned yylt May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node lifecycle controller does not `markPodsNotReady` when the node `Ready` state changes from `false` to `unknown` #112733

Node lifecycle controller does not `markPodsNotReady` when the node `Ready` state changes from `false` to `unknown` #112733

xenv commented Sep 26, 2022 •

edited

answer1991 commented Sep 26, 2022

bobbypage commented Sep 26, 2022 •

edited

akankshakumari393 commented Sep 28, 2022

akankshakumari393 commented Sep 28, 2022 •

edited

SergeyKanzhelev commented Sep 28, 2022

SergeyKanzhelev commented Sep 28, 2022

akankshakumari393 commented Sep 29, 2022

k8s-ci-robot commented Sep 29, 2022

MustkimKhatik commented Oct 2, 2022

akankshakumari393 commented Oct 2, 2022

MustkimKhatik commented Oct 2, 2022

answer1991 commented Oct 8, 2022

maan19 commented Nov 30, 2022 •

edited

aojea commented Dec 2, 2022

ashutosh887 commented Mar 15, 2023

lance5890 commented May 6, 2023 •

edited

dsxing commented May 24, 2023

utkarsh-singh1 commented Jan 30, 2024

yylt commented May 10, 2024

yylt commented May 10, 2024

Node lifecycle controller does not markPodsNotReady when the node Ready state changes from false to unknown #112733

Node lifecycle controller does not markPodsNotReady when the node Ready state changes from false to unknown #112733

Comments

xenv commented Sep 26, 2022 • edited

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

answer1991 commented Sep 26, 2022

bobbypage commented Sep 26, 2022 • edited

akankshakumari393 commented Sep 28, 2022

akankshakumari393 commented Sep 28, 2022 • edited

SergeyKanzhelev commented Sep 28, 2022

SergeyKanzhelev commented Sep 28, 2022

akankshakumari393 commented Sep 29, 2022

k8s-ci-robot commented Sep 29, 2022

Guidelines

MustkimKhatik commented Oct 2, 2022

akankshakumari393 commented Oct 2, 2022

MustkimKhatik commented Oct 2, 2022

answer1991 commented Oct 8, 2022

maan19 commented Nov 30, 2022 • edited

aojea commented Dec 2, 2022

ashutosh887 commented Mar 15, 2023

lance5890 commented May 6, 2023 • edited

dsxing commented May 24, 2023

utkarsh-singh1 commented Jan 30, 2024

yylt commented May 10, 2024

yylt commented May 10, 2024

Node lifecycle controller does not `markPodsNotReady` when the node `Ready` state changes from `false` to `unknown` #112733

Node lifecycle controller does not `markPodsNotReady` when the node `Ready` state changes from `false` to `unknown` #112733

xenv commented Sep 26, 2022 •

edited

bobbypage commented Sep 26, 2022 •

edited

akankshakumari393 commented Sep 28, 2022 •

edited

maan19 commented Nov 30, 2022 •

edited

lance5890 commented May 6, 2023 •

edited