Take statefulsets backup on ACTIVE Cluster and restore it to the STAND BY cluster B using velero giving error #7737

kish5430 · 2024-04-25T00:47:26Z

What steps did you take and what happened:
While working on an active EKS cluster, I deployed an application with three etcd pods. I took a backup of these etcd pods using Velero. Later, I switched to a standby cluster and attempted to restore the backup. Although the restore process was successful and the pods were deployed and not running, there was a failure in attaching volumes to the etcd pods.

Command: velero backup create milvus-stg-east1-etcd-backup --selector 'app.kubernetes.io/name=etcd'

What did you expect to happen:
Volume attachment should happen and etcd pods run without any issue.

Etcd Pod logs:
Warning FailedAttachVolume 101s (x11 over 34m) attachdetach-controller (combined from similar events): AttachVolume.Attach failed for volume "pvc-ed7a6088-9f9e-46fc-88ab-bbe8364a28f7" : rpc error: code = Internal desc = Could not attach volume "vol-00c1e0e23881130c9" to node "i-03a2b2d33c76ccef2": could not attach volume "vol-00c1e0e23881130c9" to node "i-03a2b2d33c76ccef2": InvalidVolume.NotFound: The volume 'vol-00c1e0e23881130c9' does not exist.
status code: 400, request id: 4160e339-013b-4b3b-8f39-c3990cf66c2e

Here volume 'vol-00c1e0e23881130c9'' is not exist in volumes in aws

Please find the attached velero restore logs.
velero_restore.txt

allenxu404 · 2024-04-25T10:10:54Z

What Velero version are you using? Can you help provide us with more debug info by using the command from this doc.

kish5432 · 2024-04-26T16:31:57Z

@allenxu404 Please let me know if there is any additional information requires

allenxu404 · 2024-04-28T02:56:35Z

Log given above looks normal. PV was successfully restored from snapshot as below log message shows:

time="2024-04-25T05:46:33Z" level=info msg="Restoring persistent volume from snapshot." logSource="pkg/restore/restore.go:2453" restore=velero/milvus-stg-east1-etcd-restore
time="2024-04-25T05:46:34Z" level=info msg="successfully restored persistent volume from snapshot" logSource="pkg/restore/pv_restorer.go:91" persistentVolume=pvc-ed7a6088-9f9e-46fc-88ab-bbe8364a28f7 providerSnapshotID=snap-0d4da2d4c9d3f2c0d restore=velero/milvus-stg-east1-etcd-restore

It seems that the VolumeId was not available for cluster B for some reason. I think you can further troubleshoot it by restore PV on ACTIVE cluster instead of STAND BY cluster B. I assume the restore will work in that case.

kish5430 · 2024-04-28T04:26:50Z

HI @allenxu404
Its not working on Active cluster also. I did velero restore on Active Cluster and getting same issue
Thanks

blackpiglet · 2024-04-29T03:21:25Z

time="2024-04-24T18:42:02Z" level=info msg="Skipping restore of resource because it cannot be resolved via discovery" logSource="pkg/restore/restore.go:2185" resource=volumesnapshotclass.snapshot.storage.k8s.io restore=velero/milvus-stg-east1-etcd-restore
time="2024-04-24T18:42:02Z" level=info msg="Skipping restore of resource because it cannot be resolved via discovery" logSource="pkg/restore/restore.go:2185" resource=volumesnapshotcontents.snapshot.storage.k8s.io restore=velero/milvus-stg-east1-etcd-restore
time="2024-04-24T18:42:02Z" level=info msg="Skipping restore of resource because it cannot be resolved via discovery" logSource="pkg/restore/restore.go:2185" resource=volumesnapshots.snapshot.storage.k8s.io restore=velero/milvus-stg-east1-etcd-restore

It seems the CSI snapshot related CRDs are missed from the cluster.

kish5430 · 2024-04-29T05:45:14Z

time="2024-04-24T18:42:02Z" level=info msg="Skipping restore of resource because it cannot be resolved via discovery" logSource="pkg/restore/restore.go:2185" resource=volumesnapshotclass.snapshot.storage.k8s.io restore=velero/milvus-stg-east1-etcd-restore
time="2024-04-24T18:42:02Z" level=info msg="Skipping restore of resource because it cannot be resolved via discovery" logSource="pkg/restore/restore.go:2185" resource=volumesnapshotcontents.snapshot.storage.k8s.io restore=velero/milvus-stg-east1-etcd-restore
time="2024-04-24T18:42:02Z" level=info msg="Skipping restore of resource because it cannot be resolved via discovery" logSource="pkg/restore/restore.go:2185" resource=volumesnapshots.snapshot.storage.k8s.io restore=velero/milvus-stg-east1-etcd-restore

It seems the CSI snapshot related CRDs are missed from the cluster.

HI @blackpiglet

I have already installed volume snapshot crd's

$ kubectl api-resources | grep -i 'volume'
persistentvolumeclaims                    pvc                                 v1                                     true         PersistentVolumeClaim
persistentvolumes                             pv                                  v1                                     false        PersistentVolume
k8spspvolumetypes                                                             constraints.gatekeeper.sh/v1beta1      false        K8sPSPVolumeTypes
volumesnapshotclasses                    vsclass,vsclasses                   snapshot.storage.k8s.io/v1             false        VolumeSnapshotClass
volumesnapshotcontents                  vsc,vscs                            snapshot.storage.k8s.io/v1             false        VolumeSnapshotContent
volumesnapshots                                vs                                  snapshot.storage.k8s.io/v1             true         VolumeSnapshot
volumeattachments                                                             storage.k8s.io/v1                      false        VolumeAttachment
podvolumebackups                                                              velero.io/v1                           true         PodVolumeBackup
podvolumerestores                                                             velero.io/v1                           true         PodVolumeRestore
volumesnapshotlocations                   vsl                                 velero.io/v1                           true         VolumeSnapshotLocation

Thanks

allenxu404 · 2024-04-30T02:37:37Z

@kish5430 Can you help verify the status of the associated PV and PVC to confirm their functionality? Additionally, Can you access the AWS console to validate the volume's creation and ensure its proper configuration in the backend?

blackpiglet assigned allenxu404 Apr 29, 2024

danfengliu mentioned this issue Apr 29, 2024

Is the Velero NativeSnapshot upload progress and state monitoring on plan #7259

Open

reasonerjt added Needs investigation Area/Cloud/AWS labels Apr 29, 2024

reasonerjt self-assigned this May 10, 2024

reasonerjt added the Needs info Waiting for information label May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Take statefulsets backup on ACTIVE Cluster and restore it to the STAND BY cluster B using velero giving error #7737

Take statefulsets backup on ACTIVE Cluster and restore it to the STAND BY cluster B using velero giving error #7737

kish5430 commented Apr 25, 2024

allenxu404 commented Apr 25, 2024

kish5432 commented Apr 26, 2024

allenxu404 commented Apr 28, 2024

kish5430 commented Apr 28, 2024

blackpiglet commented Apr 29, 2024

kish5430 commented Apr 29, 2024 •

edited

allenxu404 commented Apr 30, 2024

Take statefulsets backup on ACTIVE Cluster and restore it to the STAND BY cluster B using velero giving error #7737

Take statefulsets backup on ACTIVE Cluster and restore it to the STAND BY cluster B using velero giving error #7737

Comments

kish5430 commented Apr 25, 2024

allenxu404 commented Apr 25, 2024

kish5432 commented Apr 26, 2024

allenxu404 commented Apr 28, 2024

kish5430 commented Apr 28, 2024

blackpiglet commented Apr 29, 2024

kish5430 commented Apr 29, 2024 • edited

allenxu404 commented Apr 30, 2024

kish5430 commented Apr 29, 2024 •

edited