Proposal - Add go-deadlock detection on places we use mutexes #2611

rikatz · 2024-01-10T15:45:31Z

Recently, we have implemented a feature to force Logouts on vSphere sessions.

This was due to a heavy usage of vsphere sessions and no logout after errors on session or reconciliation.

But a new problem arrived: govmomi has a deadlock risk on keepalive handler that, when calling Logout, can happen, getting all of the reconciliation process that relies on session.GetOrCreate

Because now we may be calling Logout multiple times, and this can get deadlocked if using keepalive, we need a way to detect this situation and restart in case of it. (actually we need to fix the deadlock, but we should have a way to avoid it on the future as well).

Some other projects have been using https://github.com/sasha-s/go-deadlock as a inplace replacement for the builtin sync package, so the proposal here is that at least on this specific part of the code where this deadlock may be happening (down to govmomi), we detect and restart CAPV.

The usage of this library seems like a low risk, as it can be easily reverted to use builtin sync package again, and it is being used on some other big projects.

Based on its usage, we can either flag it with options like:

Should deadlock detection be enabled?
What's a lock timeout before considering the process deadlocked?

Also we need to define if we should panic in case of deadlocks, so this allows the pod / container to restart? What would be the deadlock dealing callback?

@berlin-ab tagging you as well as we did had some discussions about it

The text was updated successfully, but these errors were encountered:

sbueringer · 2024-01-10T17:29:57Z

Overall sounds good to me

Should deadlock detection be enabled?
Also we need to define if we should panic in case of deadlocks, so this allows the pod / container to restart? What would be the deadlock dealing callback?

I think overall we can enable it. The question is what do we do if a deadlock occurs as you said. We can maybe always enable it and then make the behavior of the callback configurable via a command line flag

What's a lock timeout before considering the process deadlocked?

I guess we can pick some timeout which is higher than whatever is reasonable for the code we cover with the mutex + some buffer. I guess if we pick something >= 1m we should be okay?

berlin-ab · 2024-01-10T17:55:54Z

The question is what do we do if a deadlock occurs as you said

I think ideally the system heals itself rather than remain in a deadlocked state. I guess your question is "is it always safe to panic?"

sbueringer · 2024-01-10T18:03:52Z

Yeah and also. If we are panic'ing does this improve the situation or are we just directly panic'ing again and the only thing we achieve is ending up in CrashLoopBackoff.

But I think in this case it actually helps and improves uptime as the sessions are recreated from scratch on the client-side and especially if we're stuck in this mutex everything is blocked anyway so the situation can't get any worse than that.

(It would be different if only a specific Machine or Cluster is stuck, but this is not the case here)

k8s-triage-robot · 2024-04-09T18:18:03Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-05-09T19:15:29Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 9, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal - Add go-deadlock detection on places we use mutexes #2611

Proposal - Add go-deadlock detection on places we use mutexes #2611

rikatz commented Jan 10, 2024

sbueringer commented Jan 10, 2024 •

edited

berlin-ab commented Jan 10, 2024

sbueringer commented Jan 10, 2024 •

edited

k8s-triage-robot commented Apr 9, 2024

k8s-triage-robot commented May 9, 2024

Proposal - Add go-deadlock detection on places we use mutexes #2611

Proposal - Add go-deadlock detection on places we use mutexes #2611

Comments

rikatz commented Jan 10, 2024

sbueringer commented Jan 10, 2024 • edited

berlin-ab commented Jan 10, 2024

sbueringer commented Jan 10, 2024 • edited

k8s-triage-robot commented Apr 9, 2024

k8s-triage-robot commented May 9, 2024

sbueringer commented Jan 10, 2024 •

edited

sbueringer commented Jan 10, 2024 •

edited