Skip to content
This repository has been archived by the owner on Mar 14, 2023. It is now read-only.

Node Termination handler may still be necessary #43

Open
chrisroat opened this issue Jul 14, 2021 · 3 comments
Open

Node Termination handler may still be necessary #43

chrisroat opened this issue Jul 14, 2021 · 3 comments

Comments

@chrisroat
Copy link

The current README states this handler is deprecated in favor of the new Graceful Node Shutdown:

⚠️ Deprecation Notice
As of Kubernetes 1.20, Graceful Node Shutdown replaces the need for GCP Node termination handler. GKE on versions 1.20+ enables Graceful Node Shutdown by default. Refer to the GKE documentation and Kubernetes documentation for more info about Graceful Node Shutdown (docs, blog post).

I have been using the Node Termination handler with GKE < 1.20, using pre-emptibles with GPUs. The handler was needed to avoid a race condition on node restart that sometimes caused pods not to correctly recognize the GPU.

I have moved to GKE 1.21.1-gke.2200 and found the same error I would get with version <1.20 without the Node Termination handler. This handler happens only occasionally, so it seems like potentially the same race condition.

ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

I filed the following GKE issue.
https://issuetracker.google.com/issues/192809336

For the moment, I would ask that this repo not be deprecated.

@chrisroat
Copy link
Author

At the very least, it seems that the handler is useful through the 1.21 series of releases:
https://issuetracker.google.com/issues/204415098

@torbendury
Copy link

Hi @chrisroat! The GKE issue was closed recently. Are you still facing any problems with node shutdowns so you still need the node termination handler?

@chrisroat
Copy link
Author

I no longer maintain the (closed-source) project that was hitting the issue. We had forked this repo to add the ability to handle spot instances.

@erichamc would be able to test, though I don't think it would be high priority to check. For reference, the symptom was that the cluster's gpu workloads would not restart properly after node preemptions. Over time, a cluster might -- if it had enough preemptions to trigger the issue -- show failing workloads unable to find the nvidia libraries. [@erichamc -dropping the termination handler would amount to dropping the null_resource stanzas in infrastructure/apps/k8s/kubectl.tf]

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants