Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node-agent is not waiting for Dataupload completion before eviction during node image upgrade #7759

Open
veerendra2 opened this issue Apr 30, 2024 · 2 comments

Comments

@veerendra2
Copy link


During node image upgrade(maintenance) the DataUploads are Canceling with found a dataupload with status "InProgress" during the node-agent starting, mark it as cancel. I would except node-agent should keep running and complete the on-going backup(DataUpload) and then evacuate once on-going DataUploads are completed.


What steps did you take and what happened:

  1. Enable CSI Snapshot Data Movement
  2. Trigger a new backup with --snapshot-move-data, for example with below command
    $ velero backup create backup --include-namespaces [NAMESPACE] --include-resources persistentvolumeclaims --snapshot-move-data
  3. Trigger node image upgrade(Maintenance)
  4. node-agents gets restart and Dataupload gets cancelled with found a dataupload with status "InProgress" during the node-agent starting, mark it as cancel

What did you expect to happen:
I would expect, the node-agent should complete the on-going Dataupload and then evacuate. Maybe use Container Lifecycle Hooks and let Kubernetes wait for DataUpload completion before evacuate the node-agent po

Environment:

$ velero version
Client:
	Version: v1.13.2
	Git commit: -
Server:
	Version: v1.13.2

$ k version
Client Version: v1.30.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.5
WARNING: version difference between client (1.30) and server (1.28) exceeds the supported minor version skew of +/-1

Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
  • Kubernetes installer & version: Azure Kubernetes Service(AKS)
  • Cloud provider or hardware configuration: Azure
  • OS (e.g. from /etc/os-release): - RuntimeOS: ubuntu

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@Lyndon-Li
Copy link
Contributor

This can be addressed by #7198

@veerendra2
Copy link
Author

Meanwhile, I created a watchdog script that can run in CronJob https://github.com/veerendra2/velero-watchdog

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants