Concourse web nodes restarting with exit code 137 #5505
-
Hello everyone, I have a concourse instance that is deployed on kubernetes platform using Helm.
The web nodes seems to be ok for some time and will go crazy on restarts. When i looked at the Pods it will be either Liveness Probe or Rediness Probes that is failing and causing the restarts with Exit code 137. The following are the Probes defined Web: |
Beta Was this translation helpful? Give feedback.
Replies: 13 comments 4 replies
-
Hye @skreddy6673 , Could you share the logs from the Thank you! |
Beta Was this translation helpful? Give feedback.
-
Hi @cirocosta |
Beta Was this translation helpful? Give feedback.
-
redeployed with helm chart 8.2.13, and increased web node resources. Looks stable for now. Have to see how it does on Monday with more pipelines running. |
Beta Was this translation helpful? Give feedback.
-
Hi @vito @cirocosta One thing i couldn't understand here, once a web node goes into crashlooping there won't be any work allocated to it right? (correct me if i'm wrong) so when it comes back it should be able to start accepting work. As there is no task assigned to web node it shouldn't be failed with OOM error.
This is what i see on a restarting pod
|
Beta Was this translation helpful? Give feedback.
-
@cirocosta |
Beta Was this translation helpful? Give feedback.
-
Hey @skreddy6673, I was just browsing the discussion threads and came across this thread. I'll try and answer your question when I get some time next week. CIro is focusing on some non-concourse stuff right now and is probably letting his Concourse notifications pile up. Feel free to ping me if I don't come back around here in a few days. I'm trying to get in the habit of checking discussions more but it hasn't sunk in for me yet 😅 |
Beta Was this translation helpful? Give feedback.
-
Hi @taylorsilva My current configuration is Is there any way to filter the pipelines that are utilizing most resources? |
Beta Was this translation helpful? Give feedback.
-
oh man, so sorry 😞 our plates were super full the last few weeks. Do you have any metrics for your web pods? It would be useful to know if memory/cpu are spiking around the times you see liveliness probes failing? What version of the chart/concourse are you using now? |
Beta Was this translation helpful? Give feedback.
-
Here is my values file https://github.com/skreddy6673/concourse-chart/blob/master/v5.7.1.yml |
Beta Was this translation helpful? Give feedback.
-
CPU utilization on one of the worker when liveness probe failed |
Beta Was this translation helpful? Give feedback.
-
@taylorsilva Events from describing the Pod
|
Beta Was this translation helpful? Give feedback.
-
I think i found the solution,
Now i have another issue i need to work on, #5801 |
Beta Was this translation helpful? Give feedback.
-
@skreddy6673 like @jamieklassen mentioned, increasing the liveliness probe will probably help. k8s is killing the container because it think it's in a bad state when it's probably fine, just some workloads taking up a lot of resources. The |
Beta Was this translation helpful? Give feedback.
@skreddy6673 like @jamieklassen mentioned, increasing the liveliness probe will probably help. k8s is killing the container because it think it's in a bad state when it's probably fine, just some workloads taking up a lot of resources.
The
137
exit code is an open issue on the chart. I don't see it cross-posted here so sharing now in case this was never surfaced: concourse/concourse-chart#81