Concourse web nodes restarting with exit code 137 #5505

skreddy6673 · 2020-04-27T14:49:35Z

skreddy6673
Apr 27, 2020

Hello everyone,

I have a concourse instance that is deployed on kubernetes platform using Helm.

Web Nodes: 3, Memory 10GI, CPU 3 Core
Worker Nodes: 16, Memory 60 GI, CPU 8 Core
PostgreSQL: db.m4.xlarge

The web nodes seems to be ok for some time and will go crazy on restarts. When i looked at the Pods it will be either Liveness Probe or Rediness Probes that is failing and causing the restarts with Exit code 137.

The following are the Probes defined
Worker:
Liveness: http-get http://:worker-hc/ delay=180s timeout=3s period=15s #success=1 #failure=5

Web:
Liveness: http-get http://:atc/api/v1/info delay=180s timeout=3s period=15s #success=1 #failure=5 Readiness: http-get http://:atc/api/v1/info delay=60s timeout=3s period=5s #success=1 #failure=5

Answered by taylorsilva

Jun 23, 2020

@skreddy6673 like @jamieklassen mentioned, increasing the liveliness probe will probably help. k8s is killing the container because it think it's in a bad state when it's probably fine, just some workloads taking up a lot of resources.

The 137 exit code is an open issue on the chart. I don't see it cross-posted here so sharing now in case this was never surfaced: concourse/concourse-chart#81

View full answer

cirocosta · 2020-04-27T17:08:51Z

cirocosta
Apr 27, 2020
Collaborator

Hye @skreddy6673 ,

Could you share the logs from the web pods? That'd help us troubleshoot 😁 It'd be great to also know which version of the Chart and Concourse that you're using.

Thank you!

1 reply

skreddy6673 Apr 27, 2020
Author

Thanks @cirocosta for the reply.
This is what i see all the time when i look at the Pod events
Liveness probe failed: Get http://10.0.10.40:8888/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Readiness probe failed: Get http://10.0.10.40:8888/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Every time we see this we just keep increasing the Probe values and call it a day.

Concourse: v5.7.1 (But this is an issue we are having for a long time since v5.x)
Helm Chart: v8.2.3

We have around 1250+ pipelines in total.

We log everything, so that is another problem here, if you have any particular keywords that i can search with in the log file let me know.

skreddy6673 · 2020-05-08T15:47:07Z

skreddy6673
May 8, 2020
Author

Hi @cirocosta
Any insights on how to define reliable values for the Probes. This is happening again this week and it's worst. Planning to increase the resources (CPU and MEMORY) for the web nodes. Will update you how it goes.

0 replies

skreddy6673 · 2020-05-09T13:16:25Z

skreddy6673
May 9, 2020
Author

redeployed with helm chart 8.2.13, and increased web node resources. Looks stable for now. Have to see how it does on Monday with more pipelines running.

0 replies

skreddy6673 · 2020-05-15T18:16:24Z

skreddy6673
May 15, 2020
Author

Hi @vito @cirocosta
This started happening again
Once the web node starts to restart with exit code 137, sometimes it is crashlooping for certain amount of time and then becoming stable. But sometimes it keeps on crashlooping.

One thing i couldn't understand here, once a web node goes into crashlooping there won't be any work allocated to it right? (correct me if i'm wrong) so when it comes back it should be able to start accepting work. As there is no task assigned to web node it shouldn't be failed with OOM error.
May be i'm missing something here. May be my probe values are too fast in checking the health status of the node?
These are my current Probe values:

Liveness:   http-get http://:atc/api/v1/info delay=180s timeout=3s period=15s #success=1 #failure=5
Readiness:  http-get http://:atc/api/v1/info delay=180s timeout=3s period=15s #success=1 #failure=5

This is what i see on a restarting pod

{"timestamp":"2020-05-15T17:51:48.081203409Z","level":"info","source":"atc","message":"atc.cmd.start","data":{"session":"1"}}
{"timestamp":"2020-05-15T17:52:56.615447096Z","level":"info","source":"atc","message":"atc.db.rotate.table.re-encrypted-existing-encrypted-data","data":{"rows":8104,"session":"3.1.2","table":"resources"}}
{"timestamp":"2020-05-15T17:53:41.701711022Z","level":"info","source":"atc","message":"atc.db.rotate.table.re-encrypted-existing-encrypted-data","data":{"rows":6507,"session":"3.1.3","table":"jobs"}}
{"timestamp":"2020-05-15T17:53:54.093285129Z","level":"info","source":"atc","message":"atc.db.rotate.table.re-encrypted-existing-encrypted-data","data":{"rows":1841,"session":"3.1.4","table":"resource_types"}}
{"timestamp":"2020-05-15T17:53:54.744445099Z","level":"info","source":"atc","message":"atc.db.rotate.table.re-encrypted-existing-encrypted-data","data":{"rows":66,"session":"3.1.5","table":"builds"}}
{"timestamp":"2020-05-15T17:54:17.371084783Z","level":"info","source":"atc","message":"atc.db.rotate.table.re-encrypted-existing-encrypted-data","data":{"rows":2186,"session":"3.1.7","table":"checks"}}
{"timestamp":"2020-05-15T17:55:20.452359927Z","level":"info","source":"atc","message":"atc.db.rotate.table.re-encrypted-existing-encrypted-data","data":{"rows":8096,"session":"4.1.2","table":"resources"}}
{"timestamp":"2020-05-15T17:56:10.460659228Z","level":"info","source":"atc","message":"atc.db.rotate.table.re-encrypted-existing-encrypted-data","data":{"rows":6497,"session":"4.1.3","table":"jobs"}}

0 replies

skreddy6673 · 2020-05-28T00:04:07Z

skreddy6673
May 28, 2020
Author

@cirocosta
I know there is a lot on your plate, just checking whether if you get a chance to look into this.

0 replies

taylorsilva · 2020-05-29T02:57:39Z

taylorsilva
May 29, 2020
Maintainer

Hey @skreddy6673, I was just browsing the discussion threads and came across this thread. I'll try and answer your question when I get some time next week. CIro is focusing on some non-concourse stuff right now and is probably letting his Concourse notifications pile up.

Feel free to ping me if I don't come back around here in a few days. I'm trying to get in the habit of checking discussions more but it hasn't sunk in for me yet 😅

0 replies

skreddy6673 · 2020-06-12T00:11:15Z

skreddy6673
Jun 12, 2020
Author

Hi @taylorsilva
Sorry i don't want to push anyone to provide solutions for my issues but my concourse instance is like a Ticking bomb ready to explode.
Everything is good until today and the workers started to restart with liveness probe failure due to 137 error code.

My current configuration is
Web Nodes: 4, Memory 60GI, CPU 7 Core
Worker Nodes: 12, Memory 60 GI, CPU 5 Core
PostgreSQL: db.m4.xlarge

Is there any way to filter the pipelines that are utilizing most resources?

0 replies

taylorsilva · 2020-06-22T16:10:25Z

taylorsilva
Jun 22, 2020
Maintainer

oh man, so sorry 😞 our plates were super full the last few weeks.

Do you have any metrics for your web pods? It would be useful to know if memory/cpu are spiking around the times you see liveliness probes failing?

What version of the chart/concourse are you using now?

1 reply

skreddy6673 Jun 22, 2020
Author

@taylorsilva
currently i'm seeing issues with workers rather than web nodes

Concourse: v5.7.1
Helm Chart: v8.2.13

Description of Pod

Events:
  Type     Reason             Age                From                                   Message
  ----     ------             ----               ----                                   -------
  Warning  Unhealthy          22m (x5 over 23m)  kubelet, ip-10-164-4-146.ec2.internal  Liveness probe failed: Get http://10.0.10.5:8888/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
  Normal   Killing            22m                kubelet, ip-10-164-4-146.ec2.internal  Container concourse-prod-worker failed liveness probe, will be restarted
  Warning  FailedPreStopHook  21m                kubelet, ip-10-164-4-146.ec2.internal  Exec lifecycle hook ([/bin/bash /pre-stop-hook.sh]) for Container "concourse-prod-worker" in Pod "concourse-prod-worker-9_concourse-prod(99fff624-b272-11ea-932e-0242ac110008)" failed - error: command '/bin/bash /pre-stop-hook.sh' exited with 137: , message: ""
  Normal   Pulled             20m (x2 over 3d)   kubelet, ip-10-164-4-146.ec2.internal  Container image "concourse/concourse:5.7.1" already present on machine
  Normal   Created            20m (x2 over 3d)   kubelet, ip-10-164-4-146.ec2.internal  Created container concourse-prod-worker
  Normal   Started            20m (x2 over 3d)   kubelet, ip-10-164-4-146.ec2.internal  Started container concourse-prod-worker

skreddy6673 · 2020-06-22T22:00:18Z

skreddy6673
Jun 22, 2020
Author

Here is my values file https://github.com/skreddy6673/concourse-chart/blob/master/v5.7.1.yml

1 reply

jamieklassen Jun 23, 2020
Collaborator

I notice the timeout for your liveness probe is 3s, whereas on our environments it is 45s. I think @cirocosta chose this value and I don't know the rationale, but my wild guess is that the baggageclaim ListVolumes call underlying the health check can drastically slow down under load.

skreddy6673 · 2020-06-22T22:12:27Z

skreddy6673
Jun 22, 2020
Author

CPU utilization on one of the worker when liveness probe failed

memory utilization of worker when liveness probe failed

0 replies

skreddy6673 · 2020-06-23T14:43:04Z

skreddy6673
Jun 23, 2020
Author

@taylorsilva
Here is one of the worker that just restarted with exit code 137

Events from describing the Pod

Events:
  Type     Reason             Age                    From                                    Message
  ----     ------             ----                   ----                                    -------
  Warning  Unhealthy          5m53s (x9 over 22h)    kubelet, ip-xxx  Liveness probe failed: Get http://10.0.4.28:8888/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
  Normal   Killing            5m53s                  kubelet, ipxxx Container concourse-prod-worker failed liveness probe, will be restarted
  Warning  FailedPreStopHook  4m51s                  kubelet, ip-xxx  Exec lifecycle hook ([/bin/bash /pre-stop-hook.sh]) for Container "concourse-prod-worker" in Pod "concourse-prod-worker-5_concourse-prod(98178f68-b272-11ea-932e-0242ac110008)" failed - error: command '/bin/bash /pre-stop-hook.sh' exited with 137: , message: ""
  Normal   Pulled             4m40s (x2 over 3d16h)  kubelet, ip-xxx  Container image "concourse/concourse:5.7.1" already present on machine
  Normal   Created            4m39s (x2 over 3d16h)  kubelet, ip-xxx Created container concourse-prod-worker
  Normal   Started            4m39s (x2 over 3d16h)  kubelet, ip-xxx  Started container concourse-prod-worker

      worker
    State:          Running
      Started:      Tue, 23 Jun 2020 09:45:25 -0400
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Fri, 19 Jun 2020 17:24:45 -0400
      Finished:     Tue, 23 Jun 2020 09:45:24 -0400
    Ready:          True
    Restart Count:  1
    Limits:
      cpu:     8
      memory:  60Gi
    Requests:
      cpu:     5500m
      memory:  9Gi

0 replies

skreddy6673 · 2020-06-23T17:56:33Z

skreddy6673
Jun 23, 2020
Author

@taylorsilva @cirocosta

I think i found the solution,
The underlying nodes where concourse is running are not having sufficient resources that i request in helm chart,
Even though i request 60Gi of memory i was allocated only 30Gi due to the node instance type, I just increased the node instance type and looks like pipe;lines are running good. I will keep you posted.
Thanks for looking into this though.

Do you have any metrics for your web pods? It would be useful to know if memory/cpu are spiking around the times you see liveliness probes failing?
Your message made me to look into the node resource types.

Now i have another issue i need to work on, #5801

0 replies

taylorsilva · 2020-06-23T18:53:23Z

taylorsilva
Jun 23, 2020
Maintainer

@skreddy6673 like @jamieklassen mentioned, increasing the liveliness probe will probably help. k8s is killing the container because it think it's in a bad state when it's probably fine, just some workloads taking up a lot of resources.

The 137 exit code is an open issue on the chart. I don't see it cross-posted here so sharing now in case this was never surfaced: concourse/concourse-chart#81

1 reply

skreddy6673 Jun 23, 2020
Author

Sure, I will increase the liveliness probe values.
Yup that's me posted last message on concourse/concourse-chart#81.

Thank you!! @taylorsilva @jamieklassen

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concourse web nodes restarting with exit code 137 #5505

{{title}}

Replies: 13 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Concourse web nodes restarting with exit code 137 #5505

skreddy6673 Apr 27, 2020

Replies: 13 comments · 4 replies

cirocosta Apr 27, 2020 Collaborator

skreddy6673 Apr 27, 2020 Author

skreddy6673 May 8, 2020 Author

skreddy6673 May 9, 2020 Author

skreddy6673 May 15, 2020 Author

skreddy6673 May 28, 2020 Author

taylorsilva May 29, 2020 Maintainer

skreddy6673 Jun 12, 2020 Author

taylorsilva Jun 22, 2020 Maintainer

skreddy6673 Jun 22, 2020 Author

skreddy6673 Jun 22, 2020 Author

jamieklassen Jun 23, 2020 Collaborator

skreddy6673 Jun 22, 2020 Author

skreddy6673 Jun 23, 2020 Author

skreddy6673 Jun 23, 2020 Author

taylorsilva Jun 23, 2020 Maintainer

skreddy6673 Jun 23, 2020 Author

skreddy6673
Apr 27, 2020

Replies: 13 comments 4 replies

cirocosta
Apr 27, 2020
Collaborator

skreddy6673 Apr 27, 2020
Author

skreddy6673
May 8, 2020
Author

skreddy6673
May 9, 2020
Author

skreddy6673
May 15, 2020
Author

skreddy6673
May 28, 2020
Author

taylorsilva
May 29, 2020
Maintainer

skreddy6673
Jun 12, 2020
Author

taylorsilva
Jun 22, 2020
Maintainer

skreddy6673 Jun 22, 2020
Author

skreddy6673
Jun 22, 2020
Author

jamieklassen Jun 23, 2020
Collaborator

skreddy6673
Jun 22, 2020
Author

skreddy6673
Jun 23, 2020
Author

skreddy6673
Jun 23, 2020
Author

taylorsilva
Jun 23, 2020
Maintainer

skreddy6673 Jun 23, 2020
Author