-
Notifications
You must be signed in to change notification settings - Fork 56
Node Termination handler may still be necessary #43
Comments
At the very least, it seems that the handler is useful through the 1.21 series of releases: |
Hi @chrisroat! The GKE issue was closed recently. Are you still facing any problems with node shutdowns so you still need the node termination handler? |
I no longer maintain the (closed-source) project that was hitting the issue. We had forked this repo to add the ability to handle spot instances. @erichamc would be able to test, though I don't think it would be high priority to check. For reference, the symptom was that the cluster's gpu workloads would not restart properly after node preemptions. Over time, a cluster might -- if it had enough preemptions to trigger the issue -- show failing workloads unable to find the nvidia libraries. [@erichamc -dropping the termination handler would amount to dropping the |
The current README states this handler is deprecated in favor of the new Graceful Node Shutdown:
I have been using the Node Termination handler with GKE < 1.20, using pre-emptibles with GPUs. The handler was needed to avoid a race condition on node restart that sometimes caused pods not to correctly recognize the GPU.
I have moved to GKE 1.21.1-gke.2200 and found the same error I would get with version <1.20 without the Node Termination handler. This handler happens only occasionally, so it seems like potentially the same race condition.
I filed the following GKE issue.
https://issuetracker.google.com/issues/192809336
For the moment, I would ask that this repo not be deprecated.
The text was updated successfully, but these errors were encountered: