Node Termination handler may still be necessary #43

chrisroat · 2021-07-14T00:39:51Z

The current README states this handler is deprecated in favor of the new Graceful Node Shutdown:

⚠️ Deprecation Notice
As of Kubernetes 1.20, Graceful Node Shutdown replaces the need for GCP Node termination handler. GKE on versions 1.20+ enables Graceful Node Shutdown by default. Refer to the GKE documentation and Kubernetes documentation for more info about Graceful Node Shutdown (docs, blog post).

I have been using the Node Termination handler with GKE < 1.20, using pre-emptibles with GPUs. The handler was needed to avoid a race condition on node restart that sometimes caused pods not to correctly recognize the GPU.

I have moved to GKE 1.21.1-gke.2200 and found the same error I would get with version <1.20 without the Node Termination handler. This handler happens only occasionally, so it seems like potentially the same race condition.

ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

I filed the following GKE issue.
https://issuetracker.google.com/issues/192809336

For the moment, I would ask that this repo not be deprecated.

chrisroat · 2022-01-07T05:21:11Z

At the very least, it seems that the handler is useful through the 1.21 series of releases:
https://issuetracker.google.com/issues/204415098

torbendury · 2022-03-16T08:52:50Z

Hi @chrisroat! The GKE issue was closed recently. Are you still facing any problems with node shutdowns so you still need the node termination handler?

chrisroat · 2022-03-16T15:46:41Z

I no longer maintain the (closed-source) project that was hitting the issue. We had forked this repo to add the ability to handle spot instances.

@erichamc would be able to test, though I don't think it would be high priority to check. For reference, the symptom was that the cluster's gpu workloads would not restart properly after node preemptions. Over time, a cluster might -- if it had enough preemptions to trigger the issue -- show failing workloads unable to find the nvidia libraries. [@erichamc -dropping the termination handler would amount to dropping the null_resource stanzas in infrastructure/apps/k8s/kubectl.tf]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node Termination handler may still be necessary #43

Node Termination handler may still be necessary #43

chrisroat commented Jul 14, 2021

chrisroat commented Jan 7, 2022

torbendury commented Mar 16, 2022

chrisroat commented Mar 16, 2022

Node Termination handler may still be necessary #43

Node Termination handler may still be necessary #43

Comments

chrisroat commented Jul 14, 2021

chrisroat commented Jan 7, 2022

torbendury commented Mar 16, 2022

chrisroat commented Mar 16, 2022