-
Notifications
You must be signed in to change notification settings - Fork 560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve K8sGPT Error Reporting for CrashLoopBackOff Pods #1059
Comments
We have the optional log analyzer, which is still experimental and risky since logs may container sensitive data but it sends logs to your AI backend. We don't want to expand Pod's analyzer to fetch logs from the pod, cause the logs are arbitrary to each workload and that adds unnecessary complexity for this analyzer. The goal at some point is to have a way to compound the errors from analyzers and contextualize them so there is a cohesion between them rather than stretching individual analyzers |
@arbreezy Thanks for that. We unable to use free OpenAI account for k8sgpt, can you provide some details about the k8sgpt and AI, because we created one new OpenAI account and add the Api key and by running the command "k8sgpt analyze |
Checklist
Affected Components
K8sGPT Version
v0.3.27
Kubernetes Version
v1.26.5
Host OS and its Version
Linux
Steps to reproduce
Run the Go code as a container
package main import "fmt" import "os" func main() { var arr []int fmt.Fprintln(os.Stdout, arr[0]) }
Run the
k8sgpt analyze --filter=Pod
You can see the error
default/go-panic-pod(go-panic-pod) - Error: back-off 2m40s restarting failed container=go-panic-container pod=go-panic-pod_default(60286b08-47b6-4e7a-be19-576d3e9e6f5d) - Error: the last termination reason is Error container=go-panic-container pod=go-panic-pod
See the logs of the pod it will show the real error
kubectl logs go-panic-pod
panic: runtime error: index out of range [0] with length 0
Expected behaviour
In Pod analyzer, when the pod is in crashloopbackoff it fetches the message
back-off 5m0s restarting failed container=prometheus pod=prometheus-prometheus-kube-prometheus-0_
from the pod's CR. Instead of fetching the message from CR , if we can fetch the error from logs of the pod ,then we can know exactly what the problem is.If it fetches from the logs, then in mycase I will get :
level=error err="opening storage failed: open /prometheus/wal/00000828: no space left on device"
, with this i can know better about the cause of the crashloop backoff error.Actual behaviour
When the pod is in crashloopbackoff the error message is fetched from the pod's CR . So I cannot get the exact reason why the pod in crashloopbackoff.
Additional Information
Below I mention the real case:
kubectl get pod -n tcl-monitoring
prometheus-prometheus-kube-prometheus-0 1/2 CrashLoopBackOff 704 (3m30s ago) 22d
When the pod is in CrashLoopBackOff it fetch the below message as error
state:
waiting:
message: back-off 5m0s restarting failed container=prometheus pod=prometheus-prometheus-kube-prometheus-0_tcl-monitoring(29368fc9-fa1d-4b3d-9333-241acf0fbece)
reason: CrashLoopBackOff
k8sgpt analyze --filter=Pod --namespace=tcl-monitoring --explain
0 tcl-monitoring/prometheus-prometheus-kube-prometheus-0(prometheus-prometheus-kube-prometheus)
kubectl logs -n tcl-monitoring prometheus-prometheus-kube-prometheus-0
ts=2024-04-10T06:59:00.178Z caller=main.go:1180 level=error err="opening storage failed: open /prometheus/wal/00000828: no space left on device"
"So we can get the error from the logs instead of getting from the CR message when the pod is in crashloopbackoff"
The text was updated successfully, but these errors were encountered: