Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MetalLB ConfigMap not updated when node IP changes #412

Open
TimJones opened this issue May 16, 2023 · 12 comments
Open

MetalLB ConfigMap not updated when node IP changes #412

TimJones opened this issue May 16, 2023 · 12 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@TimJones
Copy link
Contributor

We recently ran into an issue with using Equinix CCM v3.5.0 with MetalLB v0.12.1 with the ConfigMap not having the correct IP address for a node in the cluster and not being updated to correct the misconfiguration.

The node in question had an address of 10.68.104.15

❯ kubectl --context adaptmx-DC get node omni-c3-medium-x86-2 -o wide
NAME                   STATUS   ROLES          AGE     VERSION   INTERNAL-IP    EXTERNAL-IP     OS-IMAGE         KERNEL-VERSION   CONTAINER-RUNTIME
omni-c3-medium-x86-2   Ready    loadbalancer   3d22h   v1.26.1   10.68.104.15   147.75.51.245   Talos (v1.3.6)   5.15.102-talos   containerd://1.6.18

But for some reason the entries in the MetalLB ConfigMap had another IP:

apiVersion: v1
kind: ConfigMap
metadata:
  name: equinix-metallb
  namespace: sidero-ingress
data:
  config: |
    peers:
    - my-asn: 65000
      peer-asn: 65530
      peer-address: 169.254.255.1
      peer-port: 0
      source-address: 10.68.104.109
      hold-time: ""
      router-id: ""
      node-selectors:
      - match-labels:
          kubernetes.io/hostname: omni-c3-medium-x86-2
        match-expressions: []
      - match-labels:
          nomatch.metal.equinix.com/service-name: envoy
          nomatch.metal.equinix.com/service-namespace: sidero-ingress
        match-expressions: []
      password: ""
    - my-asn: 65000
      peer-asn: 65530
      peer-address: 169.254.255.2
      peer-port: 0
      source-address: 10.68.104.109
      hold-time: ""
      router-id: ""
      node-selectors:
      - match-labels:
          kubernetes.io/hostname: omni-c3-medium-x86-2
        match-expressions: []
      - match-labels:
          nomatch.metal.equinix.com/service-name: envoy
          nomatch.metal.equinix.com/service-namespace: sidero-ingress
        match-expressions: []
      password: ""

This was causing the MetalLB speaker pod on that node to fail, and therefore not receive traffic for the BGP LoadBalancer addresses:

❯ kubectl --context adaptmx-DC -n sidero-ingress logs dc-metallb-speaker-hbf5d
{"caller":"level.go:63","error":"dial \"169.254.255.2:179\": Address \"10.68.104.109\" doesn't exist on this host","level":"error","localASN":65000,"msg":"failed to connect to peer","op":"connect","peer":"169.254.255.2:179","peerASN":65530,"ts":"2023-05-16T11:45:42.880991863Z"}
{"caller":"level.go:63","error":"dial \"169.254.255.1:179\": Address \"10.68.104.109\" doesn't exist on this host","level":"error","localASN":65000,"msg":"failed to connect to peer","op":"connect","peer":"169.254.255.1:179","peerASN":65530,"ts":"2023-05-16T11:45:42.881017161Z"}

When I manually deleted the peers entries for the host from the ConfigMap and restarted the CCM, it regenerated the peers with the correct configuration:

❯ kubectl --context adaptmx-DC -n kube-system logs cloud-provider-equinix-metal-vvztg
I0516 13:09:46.183404       1 serving.go:348] Generated self-signed cert in-memory
W0516 13:09:46.574009       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0516 13:09:46.574466       1 config.go:201] authToken: '<masked>'
I0516 13:09:46.574474       1 config.go:201] projectID: '83005521-48d5-4eae-bf4c-25c0f7d0fe97'
I0516 13:09:46.574477       1 config.go:201] load balancer config: 'metallb:///sidero-ingress/equinix-metallb'
I0516 13:09:46.574480       1 config.go:201] metro: ''
I0516 13:09:46.574484       1 config.go:201] facility: 'dc13'
I0516 13:09:46.574487       1 config.go:201] local ASN: '65000'
I0516 13:09:46.574490       1 config.go:201] Elastic IP Tag: ''
I0516 13:09:46.574493       1 config.go:201] API Server Port: '0'
I0516 13:09:46.574496       1 config.go:201] BGP Node Selector: ''
I0516 13:09:46.574535       1 controllermanager.go:145] Version: v3.5.0
I0516 13:09:46.575652       1 secure_serving.go:210] Serving securely on [::]:10258
I0516 13:09:46.575739       1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0516 13:09:46.575864       1 leaderelection.go:248] attempting to acquire leader lease kube-system/cloud-controller-manager...
I0516 13:10:03.423819       1 leaderelection.go:258] successfully acquired lease kube-system/cloud-controller-manager
I0516 13:10:03.423965       1 event.go:294] "Event occurred" object="kube-system/cloud-controller-manager" fieldPath="" kind="Lease" apiVersion="coordination.k8s.io/v1" type="Normal" reason="LeaderElection" message="omni-m3-small-x86-0_71489971-0cd0-474e-8316-e08565c637cd became leader"
I0516 13:10:03.831437       1 eip_controlplane_reconciliation.go:71] EIP Tag is not configured skipping control plane endpoint management.
I0516 13:10:04.182808       1 loadbalancers.go:86] loadbalancer implementation enabled: metallb
I0516 13:10:04.182841       1 cloud.go:98] Initialize of cloud provider complete
I0516 13:10:04.183323       1 controllermanager.go:301] Started "cloud-node"
I0516 13:10:04.183404       1 node_controller.go:157] Sending events to api server.
I0516 13:10:04.183562       1 node_controller.go:166] Waiting for informer caches to sync
I0516 13:10:04.183583       1 controllermanager.go:301] Started "cloud-node-lifecycle"
I0516 13:10:04.183714       1 node_lifecycle_controller.go:113] Sending events to api server
I0516 13:10:04.184011       1 controllermanager.go:301] Started "service"
I0516 13:10:04.184290       1 controller.go:241] Starting service controller
I0516 13:10:04.184345       1 shared_informer.go:255] Waiting for caches to sync for service
I0516 13:10:04.285005       1 shared_informer.go:262] Caches are synced for service
I0516 13:10:04.285429       1 event.go:294] "Event occurred" object="sidero-ingress/envoy" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer"
E0516 13:10:25.485873       1 loadbalancers.go:471] could not ensure BGP enabled for node omni-c3-medium-x86-2: %!w(*packngo.ErrorResponse=&{0xc0004af830 [Oh snap, something went wrong! We've logged the error and will take a look - please reach out to us if you continue having trouble.] })
I0516 13:10:27.052320       1 event.go:294] "Event occurred" object="sidero-ingress/envoy" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="EnsuredLoadBalancer" message="Ensured load balancer"
I0516 13:10:32.721346       1 event.go:294] "Event occurred" object="sidero-ingress/envoy" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="UpdatedLoadBalancer" message="Updated load balancer with new hosts"

We confirmed that the peers entries were correct in the MetalLB ConfigMap and that the node was then able to handle traffic, but I would expect this to be handled by the CCM and updated on the fly when configuration drift is detected.

@cprivitere
Copy link
Member

@ctreatma @displague Did we get the fix for this for free with the changes we made in 3.6.1 to properly support generating the peers for MetalLB <= 0.12.1?

@displague
Copy link
Member

displague commented May 17, 2023

Coming from 3.5.0, I'm not sure that 3.6.0->3.6.1 fix would be a factor.

This EM API 500 error line (in the working state, after manually updating the peers) is suspicious. I wonder if this could have been preventing the config from being automatically updated:

E0516 13:10:25.485873       1 loadbalancers.go:471] could not ensure BGP enabled for node omni-c3-medium-x86-2: %!w(*packngo.ErrorResponse=&{0xc0004af830 [Oh snap, something went wrong! We've logged the error and will take a look - please reach out to us if you continue having trouble.] })

@TimJones was this present in the logs before your manual update?

@displague displague added the kind/bug Categorizes issue or PR as related to a bug. label May 17, 2023
@displague
Copy link
Member

displague commented May 17, 2023

#198 (comment) could be related too ("Other than at startup,...")

@TimJones
Copy link
Contributor Author

@TimJones was this present in the logs before your manual update?

@displague Not as far as I saw. That error was only logged after manually modifying the ConfigMap & restarting the CCM, and only the once. I've rechecked the logs and it hasn't appeared since either, but then it hasn't logged anything since at all either.

@cprivitere
Copy link
Member

Current thinking:

  • This could be the same issue as BGP peer password not being updated after initial startup #198 , just another aspect that's impacted by it.
  • This could just be fixed with an upgrade to 3.6.2 and/or Metal LB 0.13.X's CRD style configuration
  • I'm not 100% sure how this is supposed to work, if we're supposed to update these or if metal lb is supposed to recreate them when the source node changes.

My main confusion point is HOW this happened. If this is just a result of a service moving to another node, which is just something that can happen with k8s (resources can move between nodes), then how isn't this happening all the time with folks?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 20, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 19, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 20, 2024
@cprivitere
Copy link
Member

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 20, 2024
@cprivitere
Copy link
Member

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Mar 20, 2024
@k8s-ci-robot
Copy link
Contributor

@cprivitere: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

5 participants