Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node lifecycle controller does not markPodsNotReady when the node Ready state changes from false to unknown #112733

Open
xenv opened this issue Sep 26, 2022 · 20 comments
Assignees
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@xenv
Copy link

xenv commented Sep 26, 2022

What happened?

When kubelet loses connect, the node goes into the unknown state. The node lifecycle controller marks the pod as not ready by the markPodsNotReady function because the health check status of the pod can not be obtained through kubelet. This feature is available only when node's Ready state transitions from true to unknown.

However, if the node is already in the fail state (such as a containerd failure), markPodsNotReady will not take effect if the node loses its connection at this time.

case currentReadyCondition.Status != v1.ConditionTrue && observedReadyCondition.Status == v1.ConditionTrue:
// Report node event only once when status changed.
controllerutil.RecordNodeStatusChange(nc.recorder, node, "NodeNotReady")
fallthrough
case needsRetry && observedReadyCondition.Status != v1.ConditionTrue:
if err = controllerutil.MarkPodsNotReady(ctx, nc.kubeClient, nc.recorder, pods, node.Name); err != nil {

In this case, the pod may accidentally remain ready, which may cause some network traffic to be accidentally forwarded to this node.

What did you expect to happen?

As long as the node loses its connection beyond grace time, MarkPodsNotReady should always work

How can we reproduce it (as minimally and precisely as possible)?

  1. Stop containerd and wait for the node Ready state to false
  2. Stop kubelet or shutdown the node and wait the node Ready state to unknown
  3. The pods which not be evicted on this node would be always ready

Anything else we need to know?

In the node lifecycle controller logic,MarkPodsNotReady is just triggered when a node goes from true state to an unknown state. The correct way is to trigger when the node becomes unknown state regardless of whether the node state was previously true

Kubernetes version

$ kubectl version
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.15", GitCommit:"1d79bc3bcccfba7466c44cc2055d6e7442e140ea", GitTreeState:"clean", BuildDate:"2022-09-22T06:03:36Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release

$ uname -a
5.4.119-1-tlinux4-0008 #1 SMP Fri Nov 26 11:17:45 CST 2021 x86_64 x86_64 x86_64 GNU/Linux

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@xenv xenv added the kind/bug Categorizes issue or PR as related to a bug. label Sep 26, 2022
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 26, 2022
@xenv xenv changed the title Node lifecycle controller does not markPodsNotReady when the node Ready state changes from fail to unknown Node lifecycle controller does not markPodsNotReady when the node Ready state changes from false to unknown Sep 26, 2022
@answer1991
Copy link
Contributor

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 26, 2022
@bobbypage
Copy link
Member

bobbypage commented Sep 26, 2022

Thanks for the report and the simple repro steps. This sounds like a generalized duplicate of #109998

/triage accepted

@akankshakumari393
Copy link
Member

/assign

@akankshakumari393
Copy link
Member

akankshakumari393 commented Sep 28, 2022

Is this a good-first-issue. I would like to work on it?

@SergeyKanzhelev SergeyKanzhelev added this to Triage in SIG Node Bugs Sep 28, 2022
@SergeyKanzhelev SergeyKanzhelev moved this from Triage to Triaged in SIG Node Bugs Sep 28, 2022
@SergeyKanzhelev
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 28, 2022
@SergeyKanzhelev
Copy link
Member

Is this a good-first-issue. I would like to work on it?

Yes, I believe the change required is very localized.

@akankshakumari393
Copy link
Member

/good-first-issue

@k8s-ci-robot
Copy link
Contributor

@akankshakumari393:
This request has been marked as suitable for new contributors.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Sep 29, 2022
@MustkimKhatik
Copy link

Hey, I would like to work on this

@akankshakumari393
Copy link
Member

@MustkimKhatik i am already working on it, and have raised a PR as well.🙂

@MustkimKhatik
Copy link

@MustkimKhatik i am already working on it, and have raised a PR as well.🙂

Oh alright

@answer1991
Copy link
Contributor

Is this a good-first-issue. I would like to work on it?

@akankshakumari393 Hi, I think node-controller has some complex logic, it's not suitable for a new contributor. I have reviewed your PR and found there are some broken UT. Please close your PR, @xenv and me were trying to fix this issue.

/assign

@maan19
Copy link

maan19 commented Nov 30, 2022

/assign

not sure if this has been fixed already. looks like it is not.

@aojea
Copy link
Member

aojea commented Dec 2, 2022

there are 3 PRs targeting this bug, please coordinate with the reviewers to avoid duplicating efforts

@ashutosh887
Copy link

@aojea Does this need some work?

@lance5890
Copy link

lance5890 commented May 6, 2023

I have found this problem in node_lifycycle_controller : when node is changed from ready to not ready, the ds pod is still in ready status, and will not be removed in the ep

@dsxing
Copy link

dsxing commented May 24, 2023

/assign

@utkarsh-singh1
Copy link

This issue, is still under consideration for good-first-issue, as it is reported here ⇾

Hi, I think node-controller has some complex logic, it's not suitable for a new contributor. I have reviewed your PR and found there are some broken UT. Please close your PR, @xenv and me were trying to fix this issue.

@yylt
Copy link
Contributor

yylt commented May 10, 2024

/assign

@yylt
Copy link
Contributor

yylt commented May 10, 2024

Hi all.

Although I know there have been many commits and I am aware of the earliest issue, it has not been fixed yet. I am truly sorry for the inconvenience.

Therefore, I have made a new commit in an attempt to fix the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects