Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Take statefulsets backup on ACTIVE Cluster and restore it to the STAND BY cluster B using velero giving error #7737

Open
kish5430 opened this issue Apr 25, 2024 · 7 comments
Assignees

Comments

@kish5430
Copy link

What steps did you take and what happened:
While working on an active EKS cluster, I deployed an application with three etcd pods. I took a backup of these etcd pods using Velero. Later, I switched to a standby cluster and attempted to restore the backup. Although the restore process was successful and the pods were deployed and not running, there was a failure in attaching volumes to the etcd pods.

Command: velero backup create milvus-stg-east1-etcd-backup --selector 'app.kubernetes.io/name=etcd'

What did you expect to happen:
Volume attachment should happen and etcd pods run without any issue.

Etcd Pod logs:
Warning FailedAttachVolume 101s (x11 over 34m) attachdetach-controller (combined from similar events): AttachVolume.Attach failed for volume "pvc-ed7a6088-9f9e-46fc-88ab-bbe8364a28f7" : rpc error: code = Internal desc = Could not attach volume "vol-00c1e0e23881130c9" to node "i-03a2b2d33c76ccef2": could not attach volume "vol-00c1e0e23881130c9" to node "i-03a2b2d33c76ccef2": InvalidVolume.NotFound: The volume 'vol-00c1e0e23881130c9' does not exist.
status code: 400, request id: 4160e339-013b-4b3b-8f39-c3990cf66c2e

Here volume 'vol-00c1e0e23881130c9'' is not exist in volumes in aws

Please find the attached velero restore logs.
velero_restore.txt

@allenxu404
Copy link
Contributor

What Velero version are you using? Can you help provide us with more debug info by using the command from this doc.

@kish5432
Copy link

@allenxu404 Please let me know if there is any additional information requires

@allenxu404
Copy link
Contributor

Log given above looks normal. PV was successfully restored from snapshot as below log message shows:

time="2024-04-25T05:46:33Z" level=info msg="Restoring persistent volume from snapshot." logSource="pkg/restore/restore.go:2453" restore=velero/milvus-stg-east1-etcd-restore
time="2024-04-25T05:46:34Z" level=info msg="successfully restored persistent volume from snapshot" logSource="pkg/restore/pv_restorer.go:91" persistentVolume=pvc-ed7a6088-9f9e-46fc-88ab-bbe8364a28f7 providerSnapshotID=snap-0d4da2d4c9d3f2c0d restore=velero/milvus-stg-east1-etcd-restore

It seems that the VolumeId was not available for cluster B for some reason. I think you can further troubleshoot it by restore PV on ACTIVE cluster instead of STAND BY cluster B. I assume the restore will work in that case.

@kish5430
Copy link
Author

HI @allenxu404
Its not working on Active cluster also. I did velero restore on Active Cluster and getting same issue
Thanks

@blackpiglet
Copy link
Contributor

time="2024-04-24T18:42:02Z" level=info msg="Skipping restore of resource because it cannot be resolved via discovery" logSource="pkg/restore/restore.go:2185" resource=volumesnapshotclass.snapshot.storage.k8s.io restore=velero/milvus-stg-east1-etcd-restore
time="2024-04-24T18:42:02Z" level=info msg="Skipping restore of resource because it cannot be resolved via discovery" logSource="pkg/restore/restore.go:2185" resource=volumesnapshotcontents.snapshot.storage.k8s.io restore=velero/milvus-stg-east1-etcd-restore
time="2024-04-24T18:42:02Z" level=info msg="Skipping restore of resource because it cannot be resolved via discovery" logSource="pkg/restore/restore.go:2185" resource=volumesnapshots.snapshot.storage.k8s.io restore=velero/milvus-stg-east1-etcd-restore

It seems the CSI snapshot related CRDs are missed from the cluster.

@kish5430
Copy link
Author

kish5430 commented Apr 29, 2024

time="2024-04-24T18:42:02Z" level=info msg="Skipping restore of resource because it cannot be resolved via discovery" logSource="pkg/restore/restore.go:2185" resource=volumesnapshotclass.snapshot.storage.k8s.io restore=velero/milvus-stg-east1-etcd-restore
time="2024-04-24T18:42:02Z" level=info msg="Skipping restore of resource because it cannot be resolved via discovery" logSource="pkg/restore/restore.go:2185" resource=volumesnapshotcontents.snapshot.storage.k8s.io restore=velero/milvus-stg-east1-etcd-restore
time="2024-04-24T18:42:02Z" level=info msg="Skipping restore of resource because it cannot be resolved via discovery" logSource="pkg/restore/restore.go:2185" resource=volumesnapshots.snapshot.storage.k8s.io restore=velero/milvus-stg-east1-etcd-restore

It seems the CSI snapshot related CRDs are missed from the cluster.

HI @blackpiglet

I have already installed volume snapshot crd's

$ kubectl api-resources | grep -i 'volume'
persistentvolumeclaims                    pvc                                 v1                                     true         PersistentVolumeClaim
persistentvolumes                             pv                                  v1                                     false        PersistentVolume
k8spspvolumetypes                                                             constraints.gatekeeper.sh/v1beta1      false        K8sPSPVolumeTypes
volumesnapshotclasses                    vsclass,vsclasses                   snapshot.storage.k8s.io/v1             false        VolumeSnapshotClass
volumesnapshotcontents                  vsc,vscs                            snapshot.storage.k8s.io/v1             false        VolumeSnapshotContent
volumesnapshots                                vs                                  snapshot.storage.k8s.io/v1             true         VolumeSnapshot
volumeattachments                                                             storage.k8s.io/v1                      false        VolumeAttachment
podvolumebackups                                                              velero.io/v1                           true         PodVolumeBackup
podvolumerestores                                                             velero.io/v1                           true         PodVolumeRestore
volumesnapshotlocations                   vsl                                 velero.io/v1                           true         VolumeSnapshotLocation

Thanks

@allenxu404
Copy link
Contributor

@kish5430 Can you help verify the status of the associated PV and PVC to confirm their functionality? Additionally, Can you access the AWS console to validate the volume's creation and ensure its proper configuration in the backend?

@reasonerjt reasonerjt self-assigned this May 10, 2024
@reasonerjt reasonerjt added the Needs info Waiting for information label May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants