MetalLB ConfigMap not updated when node IP changes #412

TimJones · 2023-05-16T13:36:13Z

We recently ran into an issue with using Equinix CCM v3.5.0 with MetalLB v0.12.1 with the ConfigMap not having the correct IP address for a node in the cluster and not being updated to correct the misconfiguration.

The node in question had an address of 10.68.104.15

❯ kubectl --context adaptmx-DC get node omni-c3-medium-x86-2 -o wide
NAME                   STATUS   ROLES          AGE     VERSION   INTERNAL-IP    EXTERNAL-IP     OS-IMAGE         KERNEL-VERSION   CONTAINER-RUNTIME
omni-c3-medium-x86-2   Ready    loadbalancer   3d22h   v1.26.1   10.68.104.15   147.75.51.245   Talos (v1.3.6)   5.15.102-talos   containerd://1.6.18

But for some reason the entries in the MetalLB ConfigMap had another IP:

apiVersion: v1
kind: ConfigMap
metadata:
  name: equinix-metallb
  namespace: sidero-ingress
data:
  config: |
    peers:
    - my-asn: 65000
      peer-asn: 65530
      peer-address: 169.254.255.1
      peer-port: 0
      source-address: 10.68.104.109
      hold-time: ""
      router-id: ""
      node-selectors:
      - match-labels:
          kubernetes.io/hostname: omni-c3-medium-x86-2
        match-expressions: []
      - match-labels:
          nomatch.metal.equinix.com/service-name: envoy
          nomatch.metal.equinix.com/service-namespace: sidero-ingress
        match-expressions: []
      password: ""
    - my-asn: 65000
      peer-asn: 65530
      peer-address: 169.254.255.2
      peer-port: 0
      source-address: 10.68.104.109
      hold-time: ""
      router-id: ""
      node-selectors:
      - match-labels:
          kubernetes.io/hostname: omni-c3-medium-x86-2
        match-expressions: []
      - match-labels:
          nomatch.metal.equinix.com/service-name: envoy
          nomatch.metal.equinix.com/service-namespace: sidero-ingress
        match-expressions: []
      password: ""

This was causing the MetalLB speaker pod on that node to fail, and therefore not receive traffic for the BGP LoadBalancer addresses:

❯ kubectl --context adaptmx-DC -n sidero-ingress logs dc-metallb-speaker-hbf5d
{"caller":"level.go:63","error":"dial \"169.254.255.2:179\": Address \"10.68.104.109\" doesn't exist on this host","level":"error","localASN":65000,"msg":"failed to connect to peer","op":"connect","peer":"169.254.255.2:179","peerASN":65530,"ts":"2023-05-16T11:45:42.880991863Z"}
{"caller":"level.go:63","error":"dial \"169.254.255.1:179\": Address \"10.68.104.109\" doesn't exist on this host","level":"error","localASN":65000,"msg":"failed to connect to peer","op":"connect","peer":"169.254.255.1:179","peerASN":65530,"ts":"2023-05-16T11:45:42.881017161Z"}

When I manually deleted the peers entries for the host from the ConfigMap and restarted the CCM, it regenerated the peers with the correct configuration:

❯ kubectl --context adaptmx-DC -n kube-system logs cloud-provider-equinix-metal-vvztg
I0516 13:09:46.183404       1 serving.go:348] Generated self-signed cert in-memory
W0516 13:09:46.574009       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0516 13:09:46.574466       1 config.go:201] authToken: '<masked>'
I0516 13:09:46.574474       1 config.go:201] projectID: '83005521-48d5-4eae-bf4c-25c0f7d0fe97'
I0516 13:09:46.574477       1 config.go:201] load balancer config: 'metallb:///sidero-ingress/equinix-metallb'
I0516 13:09:46.574480       1 config.go:201] metro: ''
I0516 13:09:46.574484       1 config.go:201] facility: 'dc13'
I0516 13:09:46.574487       1 config.go:201] local ASN: '65000'
I0516 13:09:46.574490       1 config.go:201] Elastic IP Tag: ''
I0516 13:09:46.574493       1 config.go:201] API Server Port: '0'
I0516 13:09:46.574496       1 config.go:201] BGP Node Selector: ''
I0516 13:09:46.574535       1 controllermanager.go:145] Version: v3.5.0
I0516 13:09:46.575652       1 secure_serving.go:210] Serving securely on [::]:10258
I0516 13:09:46.575739       1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0516 13:09:46.575864       1 leaderelection.go:248] attempting to acquire leader lease kube-system/cloud-controller-manager...
I0516 13:10:03.423819       1 leaderelection.go:258] successfully acquired lease kube-system/cloud-controller-manager
I0516 13:10:03.423965       1 event.go:294] "Event occurred" object="kube-system/cloud-controller-manager" fieldPath="" kind="Lease" apiVersion="coordination.k8s.io/v1" type="Normal" reason="LeaderElection" message="omni-m3-small-x86-0_71489971-0cd0-474e-8316-e08565c637cd became leader"
I0516 13:10:03.831437       1 eip_controlplane_reconciliation.go:71] EIP Tag is not configured skipping control plane endpoint management.
I0516 13:10:04.182808       1 loadbalancers.go:86] loadbalancer implementation enabled: metallb
I0516 13:10:04.182841       1 cloud.go:98] Initialize of cloud provider complete
I0516 13:10:04.183323       1 controllermanager.go:301] Started "cloud-node"
I0516 13:10:04.183404       1 node_controller.go:157] Sending events to api server.
I0516 13:10:04.183562       1 node_controller.go:166] Waiting for informer caches to sync
I0516 13:10:04.183583       1 controllermanager.go:301] Started "cloud-node-lifecycle"
I0516 13:10:04.183714       1 node_lifecycle_controller.go:113] Sending events to api server
I0516 13:10:04.184011       1 controllermanager.go:301] Started "service"
I0516 13:10:04.184290       1 controller.go:241] Starting service controller
I0516 13:10:04.184345       1 shared_informer.go:255] Waiting for caches to sync for service
I0516 13:10:04.285005       1 shared_informer.go:262] Caches are synced for service
I0516 13:10:04.285429       1 event.go:294] "Event occurred" object="sidero-ingress/envoy" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer"
E0516 13:10:25.485873       1 loadbalancers.go:471] could not ensure BGP enabled for node omni-c3-medium-x86-2: %!w(*packngo.ErrorResponse=&{0xc0004af830 [Oh snap, something went wrong! We've logged the error and will take a look - please reach out to us if you continue having trouble.] })
I0516 13:10:27.052320       1 event.go:294] "Event occurred" object="sidero-ingress/envoy" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="EnsuredLoadBalancer" message="Ensured load balancer"
I0516 13:10:32.721346       1 event.go:294] "Event occurred" object="sidero-ingress/envoy" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="UpdatedLoadBalancer" message="Updated load balancer with new hosts"

We confirmed that the peers entries were correct in the MetalLB ConfigMap and that the node was then able to handle traffic, but I would expect this to be handled by the CCM and updated on the fly when configuration drift is detected.

The text was updated successfully, but these errors were encountered:

cprivitere · 2023-05-16T19:12:02Z

@ctreatma @displague Did we get the fix for this for free with the changes we made in 3.6.1 to properly support generating the peers for MetalLB <= 0.12.1?

displague · 2023-05-17T14:00:07Z

Coming from 3.5.0, I'm not sure that 3.6.0->3.6.1 fix would be a factor.

This EM API 500 error line (in the working state, after manually updating the peers) is suspicious. I wonder if this could have been preventing the config from being automatically updated:

E0516 13:10:25.485873       1 loadbalancers.go:471] could not ensure BGP enabled for node omni-c3-medium-x86-2: %!w(*packngo.ErrorResponse=&{0xc0004af830 [Oh snap, something went wrong! We've logged the error and will take a look - please reach out to us if you continue having trouble.] })

@TimJones was this present in the logs before your manual update?

displague · 2023-05-17T14:17:54Z

#198 (comment) could be related too ("Other than at startup,...")

TimJones · 2023-05-17T14:39:36Z

@TimJones was this present in the logs before your manual update?

@displague Not as far as I saw. That error was only logged after manually modifying the ConfigMap & restarting the CCM, and only the once. I've rechecked the logs and it hasn't appeared since either, but then it hasn't logged anything since at all either.

cprivitere · 2023-05-17T15:48:06Z

Current thinking:

This could be the same issue as BGP peer password not being updated after initial startup #198 , just another aspect that's impacted by it.
This could just be fixed with an upgrade to 3.6.2 and/or Metal LB 0.13.X's CRD style configuration
I'm not 100% sure how this is supposed to work, if we're supposed to update these or if metal lb is supposed to recreate them when the source node changes.

My main confusion point is HOW this happened. If this is just a result of a service moving to another node, which is just something that can happen with k8s (resources can move between nodes), then how isn't this happening all the time with folks?

k8s-triage-robot · 2024-01-20T17:12:55Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-02-19T17:16:12Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-03-20T18:04:00Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-03-20T18:04:04Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cprivitere · 2024-03-20T19:39:50Z

/remove-lifecycle rotten

cprivitere · 2024-03-20T19:39:53Z

/reopen

k8s-ci-robot · 2024-03-20T19:39:58Z

@cprivitere: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

displague added the kind/bug Categorizes issue or PR as related to a bug. label May 17, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 20, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 19, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 20, 2024

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 20, 2024

k8s-ci-robot reopened this Mar 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MetalLB ConfigMap not updated when node IP changes #412

MetalLB ConfigMap not updated when node IP changes #412

TimJones commented May 16, 2023

cprivitere commented May 16, 2023

displague commented May 17, 2023 •

edited

displague commented May 17, 2023 •

edited

TimJones commented May 17, 2023

cprivitere commented May 17, 2023

k8s-triage-robot commented Jan 20, 2024

k8s-triage-robot commented Feb 19, 2024

k8s-triage-robot commented Mar 20, 2024

k8s-ci-robot commented Mar 20, 2024

cprivitere commented Mar 20, 2024

cprivitere commented Mar 20, 2024

k8s-ci-robot commented Mar 20, 2024

MetalLB ConfigMap not updated when node IP changes #412

MetalLB ConfigMap not updated when node IP changes #412

Comments

TimJones commented May 16, 2023

cprivitere commented May 16, 2023

displague commented May 17, 2023 • edited

displague commented May 17, 2023 • edited

TimJones commented May 17, 2023

cprivitere commented May 17, 2023

k8s-triage-robot commented Jan 20, 2024

k8s-triage-robot commented Feb 19, 2024

k8s-triage-robot commented Mar 20, 2024

k8s-ci-robot commented Mar 20, 2024

cprivitere commented Mar 20, 2024

cprivitere commented Mar 20, 2024

k8s-ci-robot commented Mar 20, 2024

displague commented May 17, 2023 •

edited

displague commented May 17, 2023 •

edited