Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal - Add go-deadlock detection on places we use mutexes #2611

Open
rikatz opened this issue Jan 10, 2024 · 5 comments
Open

Proposal - Add go-deadlock detection on places we use mutexes #2611

rikatz opened this issue Jan 10, 2024 · 5 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@rikatz
Copy link
Contributor

rikatz commented Jan 10, 2024

Recently, we have implemented a feature to force Logouts on vSphere sessions.

This was due to a heavy usage of vsphere sessions and no logout after errors on session or reconciliation.

But a new problem arrived: govmomi has a deadlock risk on keepalive handler that, when calling Logout, can happen, getting all of the reconciliation process that relies on session.GetOrCreate

Because now we may be calling Logout multiple times, and this can get deadlocked if using keepalive, we need a way to detect this situation and restart in case of it. (actually we need to fix the deadlock, but we should have a way to avoid it on the future as well).

Some other projects have been using https://github.com/sasha-s/go-deadlock as a inplace replacement for the builtin sync package, so the proposal here is that at least on this specific part of the code where this deadlock may be happening (down to govmomi), we detect and restart CAPV.

The usage of this library seems like a low risk, as it can be easily reverted to use builtin sync package again, and it is being used on some other big projects.

Based on its usage, we can either flag it with options like:

  • Should deadlock detection be enabled?
  • What's a lock timeout before considering the process deadlocked?

Also we need to define if we should panic in case of deadlocks, so this allows the pod / container to restart? What would be the deadlock dealing callback?

@berlin-ab tagging you as well as we did had some discussions about it

@sbueringer
Copy link
Member

sbueringer commented Jan 10, 2024

Overall sounds good to me

Should deadlock detection be enabled?
Also we need to define if we should panic in case of deadlocks, so this allows the pod / container to restart? What would be the deadlock dealing callback?

I think overall we can enable it. The question is what do we do if a deadlock occurs as you said. We can maybe always enable it and then make the behavior of the callback configurable via a command line flag

What's a lock timeout before considering the process deadlocked?

I guess we can pick some timeout which is higher than whatever is reasonable for the code we cover with the mutex + some buffer. I guess if we pick something >= 1m we should be okay?

@berlin-ab
Copy link
Contributor

The question is what do we do if a deadlock occurs as you said

I think ideally the system heals itself rather than remain in a deadlocked state. I guess your question is "is it always safe to panic?"

@sbueringer
Copy link
Member

sbueringer commented Jan 10, 2024

Yeah and also. If we are panic'ing does this improve the situation or are we just directly panic'ing again and the only thing we achieve is ending up in CrashLoopBackoff.

But I think in this case it actually helps and improves uptime as the sessions are recreated from scratch on the client-side and especially if we're stuck in this mutex everything is blocked anyway so the situation can't get any worse than that.

(It would be different if only a specific Machine or Cluster is stuck, but this is not the case here)

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 9, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants