Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --condition #276

Draft
wants to merge 10 commits into
base: master
Choose a base branch
from
Draft

Add --condition #276

wants to merge 10 commits into from

Conversation

felipecrs
Copy link

@felipecrs felipecrs commented Aug 25, 2023

Usage:

stern . --condition=ready=false

TODOs:

  • Only starts tailing logs for pods matching --condition
  • Stop tailing logs for pods that no longer match --condition

Fixes #213

@felipecrs
Copy link
Author

felipecrs commented Aug 25, 2023

I'm giving #213 a go because I really need it. But I'm struggling with the second TODO. Is there a chance someone can help me?

PS: code is not yet ready for final review, but it's working. Also, I have almost zero experience with Golang.

@felipecrs felipecrs changed the title Add --condition condition-name[=condition-value] Add support for --condition Aug 25, 2023
@superbrothers
Copy link
Member

I will look into it this weekend.

@felipecrs
Copy link
Author

Thanks a lot @superbrothers!

@superbrothers
Copy link
Member

It is on the way, but I have a working patch. I won't have much time this weekend but will be able to work on it after that. 816c896

You can try it out as soon as you build it.

@felipecrs
Copy link
Author

felipecrs commented Sep 8, 2023

@superbrothers, thanks a lot, you are the best!

However, it's not working quite as I expected. Let me help build a testing scenario too:

kubectl apply -f - <<'EOF'
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: test
spec:
  replicas: 1
  selector:
    matchLabels:
      app: test
  serviceName: test
  template:
    metadata:
      labels:
        app: test
    spec:
      terminationGracePeriodSeconds: 1
      containers:
      - name: test
        image: bash
        command:
          - bash
          - -c
          - |
            count=0
            while true; do
              echo "Hello World: $count"
              count=$((count+1))
              if [[ $count -eq 10 ]]; then
                touch /tmp/healthy
              fi
              sleep 1
            done
        readinessProbe:
          exec:
            command:
            - cat
            - /tmp/healthy
          initialDelaySeconds: 10
          periodSeconds: 1
EOF

This pod becomes ready in about 11 seconds.

Start the stern process with:

./dist/stern '.*' --condition ready=false --tail 0

Stern correctly filters out the pod if the pod is already ready, but Stern does not stop tailing the logs when the pod becomes ready:

WindowsTerminal_lMhjxZZGe5.mp4

@felipecrs
Copy link
Author

Ok, I think I know what's going on: K8s does not seem to emit an event for when the pod becomes ready. I'm investigating what can be done about it.

@felipecrs
Copy link
Author

felipecrs commented Sep 8, 2023

Checking kubectl wait's source code, it looks like they invoke an additional watcher for each resource until the condition is met:

https://github.com/kubernetes/kubectl/blob/0266cec8bce880387a33c4d948673749defeb0bc/pkg/cmd/wait/wait.go#L530

I wonder if and how we could do something similar:

If the pod initially matches condition and gets added, we invoke this additional watcher for such pod until the opposite condition is met. When the opposite condition is met, we add the pod to deleted.

(I already tried to tinker with it myself without much success)

@tksm
Copy link
Contributor

tksm commented Sep 9, 2023

Hi,

It appears that Stern has detected the events, but they could be filtered out by targetFilter.shouldAdd().

The following patch may resolve the issue, but I've noticed the behavior of --condition ready=false can be somewhat confusing. When we delete pods, the pod's Ready status reverts to False, causing the logs to restart from the beginning. Additionally, the Ready status can change due to other reasons.

diff --git a/stern/stern.go b/stern/stern.go
index c492519..dcece50 100644
--- a/stern/stern.go
+++ b/stern/stern.go
@@ -222,7 +222,7 @@ func Run(ctx context.Context, config *Config) error {
 					if !ok {
 						continue
 					}
-					cancel.(func())()
+					cancel.(context.CancelFunc)()
 				case <-nctx.Done():
 					return nil
 				}
diff --git a/stern/target.go b/stern/target.go
index 553e8c2..cbc6788 100644
--- a/stern/target.go
+++ b/stern/target.go
@@ -136,8 +136,14 @@ OUTER:
 			Container: c.Name,
 		}
 
+		if !conditionFound {
+			visitor(t, false)
+			f.forget(string(pod.UID))
+			continue
+		}
+
 		if f.shouldAdd(t, string(pod.UID), c) {
-			visitor(t, conditionFound)
+			visitor(t, true)
 		}
 	}
 }

The demo below illustrates that the logs restart when the pod is deleted (at 00:18).

asciicast

@felipecrs
Copy link
Author

Wow, it looks to be working indeed as per your demo. Can't wait to try your patch out.

BTW the behavior you described is indeed misleading.

That would not happen though when using --tail=0, which is my use case (only capture live logs, not past logs).

I will think about what can be done to circumvent this.

Maybe Stern has to remember up until which point it tailed logs for a given pod, and when it starts tailing it again, it should tail from that point on.

@tksm
Copy link
Contributor

tksm commented Sep 9, 2023

@felipecrs Oh, sorry. I missed the --tail=0 option. Thank you for pointing it out.

As you mentioned, Stern behaves as expected when using --tail=0.

I believe it would be better to automatically set --tail=0 when the --condition option is specified in order to avoid confusion.

asciicast

@felipecrs
Copy link
Author

@tksm really amazing, indeed it works like a charm. I added both @superbrothers' commit and your patch to this PR now.

Something I believe I need to do before merging is to fix the condition syntax for the config file. I left it as a string there, which makes it exactly as the CLI, but perhaps I should convert it to an object like:

- condition: ready=false
+ condition:
+   name: ready
+   value: false

What do you think? Personally, I don't mind. Even the first one looks good to me.

Another thing to think about is the concern raised by @tksm above when not using --tail=0. I proposed one solution above but I'm not sure if it's worth implementing it. Another option is to simply have this caveat acknowledged somewhere, maybe in the README.

@tksm
Copy link
Contributor

tksm commented Sep 10, 2023

I'm glad to hear it worked as expected. 🎉

Config file

I believe that the first one (condition: ready=false) is better because it maintains consistency with other options. Additionally, we do not need to implement anything in that case.

--tail=0 option

For the initial implementation, I think that the caveat in README.md and the error message when --tail=0 is not set are sufficient. Maybe we should avoid automatically setting --tail=0 for future compatibility.

# Raise an error when `--tail=0` is not set
$ ./dist/stern '.*' --condition ready=false
Error: Currently, --condition and --no-follow=false only work with --tail=0
Usage:
  stern pod-query [flags]
diff --git a/cmd/cmd.go b/cmd/cmd.go
index 35e3060..2c70027 100644
--- a/cmd/cmd.go
+++ b/cmd/cmd.go
@@ -130,6 +130,9 @@ func (o *options) Validate() error {
 	if o.selector != "" && o.resource != "" {
 		return errors.New("--selector and the <resource>/<name> query can not be set at the same time")
 	}
+	if o.condition != "" && !o.noFollow && o.tail != 0 {
+		return errors.New("Currently, --condition and --no-follow=false only work with --tail=0")
+	}
 
 	return nil
 }

@felipecrs
Copy link
Author

I'm sorry @tksm, but why enforce --tail=0 to --no-follow=false? Maybe you meant --no-follow=true, where it makes more sense: if --no-follow=true, no live logs would be shown, only past logs. Thus, filtering only live logs with --tail=0 makes no sense.

I changed it in the commit.

@felipecrs
Copy link
Author

felipecrs commented Sep 10, 2023

I also just realized that --condition=ready=false will filter out pods which does not have a health check, like Jobs.

For example, with --condition=ready=false, logs for a pod like below will not be shown:

apiVersion: batch/v1
kind: Job
metadata:
  name: test-two
spec:
  completions: 1
  template:
    metadata:
      labels:
        app: test-two
    spec:
      terminationGracePeriodSeconds: 1
      restartPolicy: OnFailure
      containers:
      - name: test-two
        image: bash
        command:
          - bash
          - -c
          - |
            count=0
            while true; do
              echo "World Hello: $count"
              count=$((count+1))
              if [[ $count -eq 10 ]]; then
                exit 1
              fi
              sleep 1
            done

Similarly, for a pod which does not have a readinessProbe:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: test-two
spec:
  replicas: 1
  selector:
    matchLabels:
      app: test-two
  serviceName: test-two
  template:
    metadata:
      labels:
        app: test-two
    spec:
      terminationGracePeriodSeconds: 1
      containers:
      - name: test-two
        image: bash
        command:
          - bash
          - -c
          - |
            count=0
            while true; do
              echo "World Hello: $count"
              count=$((count+1))
              if [[ $count -eq 10 ]]; then
                exit 0
              fi
              sleep 1
            done

My use case (#213) was to filter out only healthy pods, and simple conditions like that cannot fulfil my needs.

The only solution I thought to circumvent this is: --only-condition-pods-with-readiness. But:

  1. The flag name is ugly as hell (don't you think? or do you have better suggestions?)
  2. I could just simplify this whole thing to --only-unhealthy-pods and drop --condition entirely.

Feedback is appreciated.

@felipecrs
Copy link
Author

felipecrs commented Sep 10, 2023

Another thing, @tksm, I think allowing people to specify --tail more than 0 can be useful. Imagine a situation like:

I want to only display logs for pods which are not ready, but displaying their last 3 lines before they became unhealthy.

@felipecrs felipecrs changed the title Add support for --condition Add --condition Sep 10, 2023
@felipecrs
Copy link
Author

felipecrs commented Sep 10, 2023

Ok, now it fulfils my use case:

stern . --condition=ready=false --tail=0 --only-condition-pods-with-readiness

Please let me know what you think about it. And, as always, thanks for the help.

@felipecrs
Copy link
Author

Ok, I have been soaking this internally during the day and the results are really good for my use case.

I would like to ask the maintainers to give me the go ahead on the current solution (--only-condition-pods-with-readiness) before I start cleaning up the PR and working on fixing the tests.

@tksm
Copy link
Contributor

tksm commented Sep 13, 2023

@felipecrs Let me clarify the conditions under which you want to view pod logs in your use case.

I assume you do NOT want to view logs when:

  1. When the pod has at least one readinessProbe OR ReadinessGate.
  2. AND the pod.status.conditions[] has Ready=True.

Please inform me if there are any additional conditions.

In that case, I believe specifying these conditions using flags might be overly complex.

I came up with another approach with --template.

It may be possible to address this use case using the --template by adding the reference to the latest pod as a template variable and a helper function like podIsReady instead of adding the flags. What are your thoughts on this approach?

Although templates can be complex, this approach offers the flexibility to handle a wide range of use cases. You can use --template-file if you want to store the template in a separate file.

./dist/stern . --template '
{{- $pod := .Reference.GetPod -}}
{{- $hasReadinessGates := gt (len $pod.Spec.ReadinessGates) 0 -}}
{{- $hasReadinessProbe := false}}
{{- range $c := $pod.Spec.Containers -}}
  {{- if $c.ReadinessProbe -}}
    {{- $hasReadinessProbe = true -}}
  {{- end -}}
{{- end -}}
{{- if not (and (or $hasReadinessProbe $hasReadinessGates) (podIsReady $pod)) -}}
{{- .Message -}}{{"\n"}}
{{- end -}}
'

@felipecrs
Copy link
Author

Isn't that a very costly operation? Checking the pod per log line as opposed to only per event?

@felipecrs
Copy link
Author

But yes, this set of conditions satisfies my use case, which can be described in high level with something like:

I want Stern to tail logs from all pods which are not healthy.

If the pod does not have a health checker (readiness probe), then I want to tail its logs since I can't tell whether it's healthy or not.

Precisely, I want to run Stern as a side car process to my helm upgrade command within my CI/CD pipeline, and these conditions allows for tracking the helm installation as it happens more decently than just showing all pod logs no matter what.

In my use case, I have something around 50 pods to tail, but I install them incrementally (one helm chart per time). Filtering out pods which are healthy ensures I'm not showing logs for pods which were installed by previous helm installations.

That said, I believe more people could leverage the same approach.

If you really believe we should not be too narrow in the implementation, maybe we can invent some kind of --condition-template, which allows to run a template that should result in a "false" or "true" to decide whether the pod should be tailed or not. And in that case, I believe not even the helper podIsReady function needs to exist, as it can be easily implemented in the template itself.

@tksm
Copy link
Contributor

tksm commented Sep 13, 2023

Thanks for your detailed explanation!

I understand the use case, and it sounds worth implementing. @superbrothers What do you think?

I prefer creating a dedicated flag for this use case to improve clarity, but I find the term unhealthy in --only-unhealthy-pods somewhat ambiguous. How about using a more specific name like --exclude-readiness-succeeded.

@felipecrs
Copy link
Author

I don't mind going the --condition-template approach BTW. :)

@tksm
Copy link
Contributor

tksm commented Sep 14, 2023

If you don't mind using a complex template, the --condition-template approach looks better since it can be used in a wide range of use cases.

For your reference, if you would have the pod variable in a condition template, the template for your use case would be something like below.

{{- $podIsReady := false -}}
{{- range $c := $pod.Status.Conditions -}}
  {{- if and (eq $c.Type "Ready") (eq $c.Status "True") -}}
    {{- $podIsReady = true -}}
  {{- end -}}
{{- end -}}

{{- $hasReadinessProbe := false}}
{{- range $c := $pod.Spec.Containers -}}
  {{- if $c.ReadinessProbe -}}
    {{- $hasReadinessProbe = true -}}
  {{- end -}}
{{- end -}}

{{- $hasReadinessGates := gt (len $pod.Spec.ReadinessGates) 0 -}}

{{- not (and (or $hasReadinessProbe $hasReadinessGates) ($podIsReady))}}

It would be great if it had a debug functionality, such as the ability to print a rendered template. For example, I debug my template by showing variables like below when using --template.

{{- if $skip -}}
[skip] {{.Message}}
{{- else -}}
{{colorGreen .Message}}
{{- end }} / ready={{$podIsReady}}, probe={{$hasReadinessProbe}}, gates={{$hasReadinessGates}}
{{- "\n" -}}
image

@guettli
Copy link
Contributor

guettli commented Apr 23, 2024

btw, I wrote a small tool to check all conditions in a cluster: https://github.com/guettli/check-conditions PRs are welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

--exclude-healthy-pods
4 participants