-
Notifications
You must be signed in to change notification settings - Fork 287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal - Add go-deadlock detection on places we use mutexes #2611
Comments
Overall sounds good to me
I think overall we can enable it. The question is what do we do if a deadlock occurs as you said. We can maybe always enable it and then make the behavior of the callback configurable via a command line flag
I guess we can pick some timeout which is higher than whatever is reasonable for the code we cover with the mutex + some buffer. I guess if we pick something >= 1m we should be okay? |
I think ideally the system heals itself rather than remain in a deadlocked state. I guess your question is "is it always safe to panic?" |
Yeah and also. If we are panic'ing does this improve the situation or are we just directly panic'ing again and the only thing we achieve is ending up in CrashLoopBackoff. But I think in this case it actually helps and improves uptime as the sessions are recreated from scratch on the client-side and especially if we're stuck in this mutex everything is blocked anyway so the situation can't get any worse than that. (It would be different if only a specific Machine or Cluster is stuck, but this is not the case here) |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
Recently, we have implemented a feature to force Logouts on vSphere sessions.
This was due to a heavy usage of vsphere sessions and no logout after errors on session or reconciliation.
But a new problem arrived: govmomi has a deadlock risk on keepalive handler that, when calling Logout, can happen, getting all of the reconciliation process that relies on session.GetOrCreate
Because now we may be calling Logout multiple times, and this can get deadlocked if using keepalive, we need a way to detect this situation and restart in case of it. (actually we need to fix the deadlock, but we should have a way to avoid it on the future as well).
Some other projects have been using https://github.com/sasha-s/go-deadlock as a inplace replacement for the builtin sync package, so the proposal here is that at least on this specific part of the code where this deadlock may be happening (down to govmomi), we detect and restart CAPV.
The usage of this library seems like a low risk, as it can be easily reverted to use builtin
sync
package again, and it is being used on some other big projects.Based on its usage, we can either flag it with options like:
Also we need to define if we should panic in case of deadlocks, so this allows the pod / container to restart? What would be the deadlock dealing callback?
@berlin-ab tagging you as well as we did had some discussions about it
The text was updated successfully, but these errors were encountered: