Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BR backup could raise error when PD leader changed during BR initialization #5630

Open
matchge-ca opened this issue Apr 21, 2024 · 3 comments

Comments

@matchge-ca
Copy link

Bug Report

What version of Kubernetes are you using?

What version of TiDB Operator are you using?

What storage classes exist in the Kubernetes cluster and what are used for PD/TiKV pods?

What's the status of the TiDB cluster pods?

What did you do?

  1. Follow any official document to backup a cluster using CR (for example, https://github.com/pingcap/tidb-operator/blob/master/cmd/backup-manager/app/backup/backup.go#L237)
  2. During the BR initialization, switch PD leader to a different pod or offline PD leader
  3. BR job will raise following error:
    error=\"pd address not available, ..., dial tcp: lookup <pd addr>: no such host, please check network
  4. This is most likely due to when executing BR using operator, only the PD leader address is used to discover PD cluster memberlist. The TiUP BR allows to add multiple PD addresses in the command line to prevent one PD failure during the discovery, maybe operator should also consider this. Code ref: https://github.com/pingcap/tidb-operator/blob/master/cmd/backup-manager/app/backup/backup.go#L237

What did you expect to see?
BR is able to run when PD leader is offline during discovery

What did you see instead?
BR failed and raised an error

@csuzhangxc
Copy link
Member

fmt.Sprintf("--pd=%s-pd.%s:%d", backup.Spec.BR.Cluster, clusterNamespace, v1alpha1.DefaultPDClientPort) is a K8s service with all PD members as the backend.

it should resolve to other PD members in different DNS lookup calls.

@kennytm
Copy link

kennytm commented Apr 24, 2024

@csuzhangxc what is actually seen from the log is that we received a DNS lookup error from CDC:

pd address (cluster-pd.namespace:2379) not available, error is

Get "https://cluster-pd.namespace:2379/pd/api/v1/config/cluster-version":
dial tcp: lookup cluster-pd.namespace on 100.64.0.10:53: no such host,

please check network: [BR:PD:ErrPDUpdateFailed]failed to update PD

is there any chance that switching PD leader will cause the DNS to report NXDOMAIN or return with zero A/AAAA records in the ANSWER section?

@csuzhangxc
Copy link
Member

@kennytm

is there any chance that switching PD leader will cause the DNS to report NXDOMAIN or return with zero A/AAAA records in the ANSWER section?

NO, can not resolve DNS should often be caused by the PD pod being down (or KubeDNS having problems)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants