BR backup could raise error when PD leader changed during BR initialization #5630

matchge-ca · 2024-04-21T03:41:43Z

Bug Report

What version of Kubernetes are you using?

What version of TiDB Operator are you using?

What storage classes exist in the Kubernetes cluster and what are used for PD/TiKV pods?

What's the status of the TiDB cluster pods?

What did you do?

Follow any official document to backup a cluster using CR (for example, https://github.com/pingcap/tidb-operator/blob/master/cmd/backup-manager/app/backup/backup.go#L237)
During the BR initialization, switch PD leader to a different pod or offline PD leader
BR job will raise following error:
error=\"pd address not available, ..., dial tcp: lookup <pd addr>: no such host, please check network
This is most likely due to when executing BR using operator, only the PD leader address is used to discover PD cluster memberlist. The TiUP BR allows to add multiple PD addresses in the command line to prevent one PD failure during the discovery, maybe operator should also consider this. Code ref: https://github.com/pingcap/tidb-operator/blob/master/cmd/backup-manager/app/backup/backup.go#L237

What did you expect to see?
BR is able to run when PD leader is offline during discovery

What did you see instead?
BR failed and raised an error

The text was updated successfully, but these errors were encountered:

csuzhangxc · 2024-04-22T07:15:10Z

fmt.Sprintf("--pd=%s-pd.%s:%d", backup.Spec.BR.Cluster, clusterNamespace, v1alpha1.DefaultPDClientPort) is a K8s service with all PD members as the backend.

it should resolve to other PD members in different DNS lookup calls.

kennytm · 2024-04-24T19:12:27Z

@csuzhangxc what is actually seen from the log is that we received a DNS lookup error from CDC:

pd address (cluster-pd.namespace:2379) not available, error is

Get "https://cluster-pd.namespace:2379/pd/api/v1/config/cluster-version":
dial tcp: lookup cluster-pd.namespace on 100.64.0.10:53: no such host,

please check network: [BR:PD:ErrPDUpdateFailed]failed to update PD

is there any chance that switching PD leader will cause the DNS to report NXDOMAIN or return with zero A/AAAA records in the ANSWER section?

csuzhangxc · 2024-04-25T02:36:43Z

@kennytm

is there any chance that switching PD leader will cause the DNS to report NXDOMAIN or return with zero A/AAAA records in the ANSWER section?

NO, can not resolve DNS should often be caused by the PD pod being down (or KubeDNS having problems)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BR backup could raise error when PD leader changed during BR initialization #5630

BR backup could raise error when PD leader changed during BR initialization #5630

matchge-ca commented Apr 21, 2024

csuzhangxc commented Apr 22, 2024

kennytm commented Apr 24, 2024

csuzhangxc commented Apr 25, 2024

BR backup could raise error when PD leader changed during BR initialization #5630

BR backup could raise error when PD leader changed during BR initialization #5630

Comments

matchge-ca commented Apr 21, 2024

Bug Report

csuzhangxc commented Apr 22, 2024

kennytm commented Apr 24, 2024

csuzhangxc commented Apr 25, 2024