Give an indication in container events for probe failure as to whether the failure was ignored due to FailureThreshold #115823

intUnderflow · 2023-02-16T09:02:13Z

Probes of all kinds currently support FailureThreshold (and SuccessThreshold), these properties allow a user to specify that Kubernetes should not take action in response to a failed probe unless it fails a successive number of times.

This is useful for end-users as it allows them to mitigate the effects of any probes that "flake" by requiring successive failure.

When a probe fails in Kubernetes, we emit a container event indicating this here: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/prober/prober.go#L110 and end-users can consume these events via the API for their own purposes. This event is emitted regardless of whether the FailureThreshold has been reached or not.

Currently when a user consumes a probe failure event they have no way of knowing whether the event resulted in action on the control plane (because the event can be ignored due to FailureThreshold, and information on this is not included in the event). This can lead to users assuming there is a problem and a container/pod was restarted when nothing occurred.

I think we should expose the keepGoing value from https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/prober/worker.go#L203 in the emitted event somehow, my preferred solution is to emit the probe failure event in the worker rather than where it currently sits in https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/prober/prober.go#L110 - there is also the option of passing some information down the stack into the prober from the worker (such as making the FailureThreshold/SuccessThreshold decision in the prober) but I'm worried about separation of concerns, happy to hear what other folks think :)

Also of note is that FailureThreshold/SuccessThreshold is the only filter I can see where a probe can be ignored after being run (and therefore emitting a container event)

I’m happy to write this PR once we’re confident in our approach :)

intUnderflow · 2023-02-16T09:02:31Z

/sig node

intUnderflow · 2023-02-16T09:04:40Z

/cc @RobertKielty

SergeyKanzhelev · 2023-02-21T19:48:19Z

/triage accepted
/priority backlog

This would be an amazing improvement in user experience indeed. Thank you for providing details on how exactly this will be implemented.

Once implemented you can also contribute by updating the probes page https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/ to mention this improvement. Perhaps start the troubleshooting probes section on that page.

/assign @intUnderflow

/good-first-issue
/help-wanted
/kind documentation
/kind cleanup

k8s-ci-robot · 2023-02-21T19:48:20Z

@SergeyKanzhelev:
This request has been marked as suitable for new contributors.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

/triage accepted
/priority backlog

This would be an amazing improvement in user experience indeed. Once implemented you can also contribute by updating the probes page https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/ to mention this improvement. Perhaps start the troubleshooting probes section on that page.

/assign @intUnderflow

/good-first-issue
/help-wanted
/kind documentation
/kind cleanup

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ashutosh887 · 2023-03-15T03:29:04Z

let me work on this issue now please
/assign

Bharath-Ganesh · 2023-03-30T20:46:49Z

@ashutosh887 Is there any update on the issue? If not can I take this up?Thanks
/assign

ashutosh887 · 2023-03-31T01:29:11Z

Yes I'm working on it 🙂

shubham-singh-748 · 2023-04-01T09:41:41Z

/assign

g4ze · 2023-06-28T02:53:06Z

/assign

g4ze · 2023-07-02T18:44:35Z

hey there, i am stuck on this issue for a while now, just wanted to make sure if im headed in the right direction.
So apparently, all that is requried from the code to do is to expose the keepGoing bool value returned from the doProbe function and make sure that the probe function now includes the keepGoing value in its return payload i.e. results.Result .
If im correct till this point then i assume, to accomodate for the new keepGoing value we would need to change the code :

kubernetes/pkg/kubelet/prober/prober.go

Lines 101 to 111 in 8ffbbe4

 if err != nil || (result != probe.Success && result != probe.Warning) { 

 // Probe failed in one way or another. 

 if err != nil { 

 klog.V(1).ErrorS(err, "Probe errored", "probeType", probeType, "pod", klog.KObj(pod), "podUID", pod.UID, "containerName", container.Name) 

 pb.recordContainerEvent(pod, &container, v1.EventTypeWarning, events.ContainerUnhealthy, "%s probe errored: %v", probeType, err) 

 } else { // result != probe.Success 

 klog.V(1).InfoS("Probe failed", "probeType", probeType, "pod", klog.KObj(pod), "podUID", pod.UID, "containerName", container.Name, "probeResult", result, "output", output) 

 pb.recordContainerEvent(pod, &container, v1.EventTypeWarning, events.ContainerUnhealthy, "%s probe failed: %s", probeType, output) 

 } 

 return results.Failure, err 

 }

where we would need to mention the keepGoing value in the pb.recordContainerEvent function call.
Please let me know if im able to understand the issue and the solution well enough.

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 16, 2023

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 16, 2023

intUnderflow changed the title ~~Give an indication in container events for probe failure as to whether the container was ignored due to FailureThreshold~~ Give an indication in container events for probe failure as to whether the failure was ignored due to FailureThreshold Feb 16, 2023

k8s-ci-robot assigned intUnderflow Feb 21, 2023

This was referenced Feb 22, 2023

Emit an event when the result of a probe for a container changes #115963

Closed

REQUEST: New membership for intUnderflow kubernetes/org#4043

Closed

k8s-ci-robot assigned ashutosh887 Mar 15, 2023

k8s-ci-robot assigned Bharath-Ganesh Mar 30, 2023

k8s-ci-robot assigned shubham-singh-748 Apr 1, 2023

Bharath-Ganesh removed their assignment Apr 1, 2023

k8s-ci-robot assigned g4ze Jun 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Give an indication in container events for probe failure as to whether the failure was ignored due to FailureThreshold #115823

Give an indication in container events for probe failure as to whether the failure was ignored due to FailureThreshold #115823

intUnderflow commented Feb 16, 2023 •

edited

intUnderflow commented Feb 16, 2023

intUnderflow commented Feb 16, 2023

SergeyKanzhelev commented Feb 21, 2023 •

edited

k8s-ci-robot commented Feb 21, 2023

ashutosh887 commented Mar 15, 2023

Bharath-Ganesh commented Mar 30, 2023

ashutosh887 commented Mar 31, 2023

shubham-singh-748 commented Apr 1, 2023

g4ze commented Jun 28, 2023

g4ze commented Jul 2, 2023

Give an indication in container events for probe failure as to whether the failure was ignored due to FailureThreshold #115823

Give an indication in container events for probe failure as to whether the failure was ignored due to FailureThreshold #115823

Comments

intUnderflow commented Feb 16, 2023 • edited

intUnderflow commented Feb 16, 2023

intUnderflow commented Feb 16, 2023

SergeyKanzhelev commented Feb 21, 2023 • edited

k8s-ci-robot commented Feb 21, 2023

Guidelines

ashutosh887 commented Mar 15, 2023

Bharath-Ganesh commented Mar 30, 2023

ashutosh887 commented Mar 31, 2023

shubham-singh-748 commented Apr 1, 2023

g4ze commented Jun 28, 2023

g4ze commented Jul 2, 2023

intUnderflow commented Feb 16, 2023 •

edited

SergeyKanzhelev commented Feb 21, 2023 •

edited