How to ignore MetalLB trying to provision CPEM LoadBalancer? #389

Lirt · 2023-03-08T15:32:38Z

Hello,

This is rather complicated issue but I'll try to explain it in simplest way.

I have standard CPEM LoadBalancer provisioned by CPEM:

k get svc
cloud-provider-equinix-metal-kubernetes-external       LoadBalancer

I use MetalLB to provision additional LoadBalancer services - currently just one ingress-nginx-caas-controller for test case.

I have issue that MetalLB is watching service cloud-provider-equinix-metal-kubernetes-external by default and it fights for updates on this service with CPEM. We see this very easily, because as soon as I start MetalLB controller the cloud-provider-equinix-metal-kubernetes-external service changes to this (see <pending>):

$ k get svc
NAME                                                   TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)                  AGE
cloud-provider-equinix-metal-kubernetes-external       LoadBalancer   172.26.85.165    <pending>     443:32557/TCP            49d

This is service description including last events to see that metallb is actually doing changes to this svc:

Name:                     cloud-provider-equinix-metal-kubernetes-external
Namespace:                kube-system
Labels:                   <none>
Annotations:              metallb.universe.tf/address-pool: disabled-metallb-do-not-use-any-address-pool
Selector:                 <none>
Type:                     LoadBalancer
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       172.26.85.165
IPs:                      172.26.85.165
IP:                       <REDACTED>
Port:                     https  443/TCP
TargetPort:               6443/TCP
NodePort:                 https  32557/TCP
Endpoints:                10.68.53.131:6443,10.68.53.137:6443,10.68.53.139:6443
Session Affinity:         None
External Traffic Policy:  Cluster
Events:
  Type     Reason                Age                From                Message
  ----     ------                ----               ----                -------
  Normal   EnsuringLoadBalancer  44m                service-controller  Ensuring load balancer
  Normal   EnsuredLoadBalancer   44m                service-controller  Ensured load balancer
  Normal   EnsuringLoadBalancer  35m                service-controller  Ensuring load balancer
  Normal   EnsuredLoadBalancer   35m                service-controller  Ensured load balancer
  Normal   EnsuringLoadBalancer  17m                service-controller  Ensuring load balancer
  Normal   EnsuredLoadBalancer   17m                service-controller  Ensured load balancer
  Warning  AllocationFailed      84s (x3 over 84s)  metallb-controller  Failed to allocate IP for "kube-system/cloud-provider-equinix-metal-kubernetes-external": ["<REDACTED>"] is not allowed in config

EQX support told us we do 15k IP assignments per day. It's most likely caused by situation describe above.

So I wanted to use new feature of MetalLB (0.13) to set loadBalancerClass that MetalLB will be watching - https://github.com/metallb/metallb/blob/77923bc823294f2f31e68193901efa3b30faea59/controller/main.go. Simply define --lb-class my-lb-class.

MetalLB stops updating cloud-provider-equinix-metal-kubernetes-external as expected. This is good.

But then what happens is that CPEM doesn't see events on service with loadBalancerClass. Meaning when I create or delete service that contains loadBalancerClass, nothing happens in CPEM.

After long troubleshooting I found out that this behavior is defined in ServiceController that CPEM uses and is expected to happen - please see this code.

Now 😄 seeing that those 2 controllers don't work well together my question is do you have recommended way how to make this setup to work correctly without DoS-ing your API or point me to where I do a mistake if I do any.

I understand that this part of the code is very unlikely to be changed. If MetalLB decided to just use annotation to ignore service it would be all good 😃 but they actually used attribute that is ignored by cloudprovider library.

Issues is easy to replicate - here is example of service I create (this service will be unnoticed by CPEM):

---
apiVersion: v1
kind: Service
metadata:
  name: ingress-nginx-caas-controller
  namespace: kube-system
spec:
  type: LoadBalancer
  allocateLoadBalancerNodePorts: true
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  loadBalancerClass: my-lb-class
  ports:
  - appProtocol: http
    name: http
    port: 80
    protocol: TCP
    targetPort: http
  - appProtocol: https
    name: https
    port: 443
    protocol: TCP
    targetPort: https
  selector:
    app.kubernetes.io/component: controller
    app.kubernetes.io/instance: ingress-nginx
    app.kubernetes.io/name: ingress-nginx

Note: Tested with latest main (#386). I think this issue was present also before and is not related to recent changes.

The text was updated successfully, but these errors were encountered:

cprivitere · 2023-03-08T16:03:39Z

@Lirt I thought the 15K ip assignments per day was due to bug #380. Are you still doing that many assignments after the fix for #380 was installed?

Lirt · 2023-03-08T16:24:40Z

Hmmm, it's hard to tell which one caused the IP assignment DoS. But the reason why service is in pending forever is this one in our case ( disappears after I stop metallb-controller). I don't have a way to see how many requests are being done right now I think...

You can eventually check the counters again in one day (or check what is the rate right now if it helps).

cprivitere · 2023-03-08T18:41:54Z

Thanks @Lirt . We've done some checking and validated that the actual cause of the error was on our API's side. No fixes to CPEM resolved it and you're not currently causing any additional assignments right now.

I appreciate you're trying to leverage LoadBalancerClass's to avoid ever accidentally triggering this again, but this particular issue can't actually be stopped with this method. It was truly on the Equinix metal API side of things.

What we CAN do is implement better rate limiting and error handling, and that's something we've targeted to do for CPEM, but I don't have a timeframe for when it would be done.

If you're still interested in using LoadBalancerClass, we can continue to look at how to make CPEM interact with them better and not run into this issue.

Lirt · 2023-03-09T08:20:38Z

Thank you for help.

This is not that important for us as long as it's not causing you internal troubles. My impression was that this is causing high amount of ip assignment requests, but if not, then it's good.

So right now only thing that is "off" is cosmetic issue - external IP of Service in <pending> state.

cloud-provider-equinix-metal-kubernetes-external       LoadBalancer   172.26.85.165    <pending>       443:32557/TCP                49d

cprivitere · 2023-03-09T13:54:34Z

Understood. Even if it's just a cosmetic issue, knowing that you're going to continue using LoadBalancerClass helps us prioritize this versus other issues when we consider what to fix next. Thank you.

k8s-triage-robot · 2024-01-18T23:59:57Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-02-18T00:52:12Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-03-19T01:46:00Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-03-19T01:46:04Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cprivitere · 2024-05-14T16:08:39Z

/reopen

k8s-ci-robot · 2024-05-14T16:08:44Z

@cprivitere: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

cprivitere · 2024-05-14T16:09:05Z

/triage accepted

ctreatma added this to the v3.7 milestone Mar 8, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 19, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 18, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 19, 2024

k8s-ci-robot reopened this May 14, 2024

k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to ignore MetalLB trying to provision CPEM LoadBalancer? #389

How to ignore MetalLB trying to provision CPEM LoadBalancer? #389

Lirt commented Mar 8, 2023 •

edited

cprivitere commented Mar 8, 2023 •

edited

Lirt commented Mar 8, 2023

cprivitere commented Mar 8, 2023

Lirt commented Mar 9, 2023

cprivitere commented Mar 9, 2023

k8s-triage-robot commented Jan 18, 2024

k8s-triage-robot commented Feb 18, 2024

k8s-triage-robot commented Mar 19, 2024

k8s-ci-robot commented Mar 19, 2024

cprivitere commented May 14, 2024

k8s-ci-robot commented May 14, 2024

cprivitere commented May 14, 2024

How to ignore MetalLB trying to provision CPEM LoadBalancer? #389

How to ignore MetalLB trying to provision CPEM LoadBalancer? #389

Comments

Lirt commented Mar 8, 2023 • edited

cprivitere commented Mar 8, 2023 • edited

Lirt commented Mar 8, 2023

cprivitere commented Mar 8, 2023

Lirt commented Mar 9, 2023

cprivitere commented Mar 9, 2023

k8s-triage-robot commented Jan 18, 2024

k8s-triage-robot commented Feb 18, 2024

k8s-triage-robot commented Mar 19, 2024

k8s-ci-robot commented Mar 19, 2024

cprivitere commented May 14, 2024

k8s-ci-robot commented May 14, 2024

cprivitere commented May 14, 2024

Lirt commented Mar 8, 2023 •

edited

cprivitere commented Mar 8, 2023 •

edited