Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to ignore MetalLB trying to provision CPEM LoadBalancer? #389

Open
Lirt opened this issue Mar 8, 2023 · 12 comments
Open

How to ignore MetalLB trying to provision CPEM LoadBalancer? #389

Lirt opened this issue Mar 8, 2023 · 12 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@Lirt
Copy link

Lirt commented Mar 8, 2023

Hello,

This is rather complicated issue but I'll try to explain it in simplest way.

I have standard CPEM LoadBalancer provisioned by CPEM:

k get svc
cloud-provider-equinix-metal-kubernetes-external       LoadBalancer

I use MetalLB to provision additional LoadBalancer services - currently just one ingress-nginx-caas-controller for test case.

I have issue that MetalLB is watching service cloud-provider-equinix-metal-kubernetes-external by default and it fights for updates on this service with CPEM. We see this very easily, because as soon as I start MetalLB controller the cloud-provider-equinix-metal-kubernetes-external service changes to this (see <pending>):

$ k get svc
NAME                                                   TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)                  AGE
cloud-provider-equinix-metal-kubernetes-external       LoadBalancer   172.26.85.165    <pending>     443:32557/TCP            49d

This is service description including last events to see that metallb is actually doing changes to this svc:

Name:                     cloud-provider-equinix-metal-kubernetes-external
Namespace:                kube-system
Labels:                   <none>
Annotations:              metallb.universe.tf/address-pool: disabled-metallb-do-not-use-any-address-pool
Selector:                 <none>
Type:                     LoadBalancer
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       172.26.85.165
IPs:                      172.26.85.165
IP:                       <REDACTED>
Port:                     https  443/TCP
TargetPort:               6443/TCP
NodePort:                 https  32557/TCP
Endpoints:                10.68.53.131:6443,10.68.53.137:6443,10.68.53.139:6443
Session Affinity:         None
External Traffic Policy:  Cluster
Events:
  Type     Reason                Age                From                Message
  ----     ------                ----               ----                -------
  Normal   EnsuringLoadBalancer  44m                service-controller  Ensuring load balancer
  Normal   EnsuredLoadBalancer   44m                service-controller  Ensured load balancer
  Normal   EnsuringLoadBalancer  35m                service-controller  Ensuring load balancer
  Normal   EnsuredLoadBalancer   35m                service-controller  Ensured load balancer
  Normal   EnsuringLoadBalancer  17m                service-controller  Ensuring load balancer
  Normal   EnsuredLoadBalancer   17m                service-controller  Ensured load balancer
  Warning  AllocationFailed      84s (x3 over 84s)  metallb-controller  Failed to allocate IP for "kube-system/cloud-provider-equinix-metal-kubernetes-external": ["<REDACTED>"] is not allowed in config

EQX support told us we do 15k IP assignments per day. It's most likely caused by situation describe above.

So I wanted to use new feature of MetalLB (0.13) to set loadBalancerClass that MetalLB will be watching - https://github.com/metallb/metallb/blob/77923bc823294f2f31e68193901efa3b30faea59/controller/main.go. Simply define --lb-class my-lb-class.

MetalLB stops updating cloud-provider-equinix-metal-kubernetes-external as expected. This is good.

But then what happens is that CPEM doesn't see events on service with loadBalancerClass. Meaning when I create or delete service that contains loadBalancerClass, nothing happens in CPEM.

After long troubleshooting I found out that this behavior is defined in ServiceController that CPEM uses and is expected to happen - please see this code.

Now 😄 seeing that those 2 controllers don't work well together my question is do you have recommended way how to make this setup to work correctly without DoS-ing your API or point me to where I do a mistake if I do any.

I understand that this part of the code is very unlikely to be changed. If MetalLB decided to just use annotation to ignore service it would be all good 😃 but they actually used attribute that is ignored by cloudprovider library.

Issues is easy to replicate - here is example of service I create (this service will be unnoticed by CPEM):

---
apiVersion: v1
kind: Service
metadata:
  name: ingress-nginx-caas-controller
  namespace: kube-system
spec:
  type: LoadBalancer
  allocateLoadBalancerNodePorts: true
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  loadBalancerClass: my-lb-class
  ports:
  - appProtocol: http
    name: http
    port: 80
    protocol: TCP
    targetPort: http
  - appProtocol: https
    name: https
    port: 443
    protocol: TCP
    targetPort: https
  selector:
    app.kubernetes.io/component: controller
    app.kubernetes.io/instance: ingress-nginx
    app.kubernetes.io/name: ingress-nginx

Note: Tested with latest main (#386). I think this issue was present also before and is not related to recent changes.

@cprivitere
Copy link
Member

cprivitere commented Mar 8, 2023

@Lirt I thought the 15K ip assignments per day was due to bug #380. Are you still doing that many assignments after the fix for #380 was installed?

@Lirt
Copy link
Author

Lirt commented Mar 8, 2023

Hmmm, it's hard to tell which one caused the IP assignment DoS. But the reason why service is in pending forever is this one in our case ( disappears after I stop metallb-controller). I don't have a way to see how many requests are being done right now I think...

You can eventually check the counters again in one day (or check what is the rate right now if it helps).

@cprivitere
Copy link
Member

Thanks @Lirt . We've done some checking and validated that the actual cause of the error was on our API's side. No fixes to CPEM resolved it and you're not currently causing any additional assignments right now.

I appreciate you're trying to leverage LoadBalancerClass's to avoid ever accidentally triggering this again, but this particular issue can't actually be stopped with this method. It was truly on the Equinix metal API side of things.

What we CAN do is implement better rate limiting and error handling, and that's something we've targeted to do for CPEM, but I don't have a timeframe for when it would be done.

If you're still interested in using LoadBalancerClass, we can continue to look at how to make CPEM interact with them better and not run into this issue.

@ctreatma ctreatma added this to the v3.7 milestone Mar 8, 2023
@Lirt
Copy link
Author

Lirt commented Mar 9, 2023

Thank you for help.

This is not that important for us as long as it's not causing you internal troubles. My impression was that this is causing high amount of ip assignment requests, but if not, then it's good.

So right now only thing that is "off" is cosmetic issue - external IP of Service in <pending> state.

cloud-provider-equinix-metal-kubernetes-external       LoadBalancer   172.26.85.165    <pending>       443:32557/TCP                49d

@cprivitere
Copy link
Member

Understood. Even if it's just a cosmetic issue, knowing that you're going to continue using LoadBalancerClass helps us prioritize this versus other issues when we consider what to fix next. Thank you.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 19, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 18, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 19, 2024
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cprivitere
Copy link
Member

/reopen

@k8s-ci-robot k8s-ci-robot reopened this May 14, 2024
@k8s-ci-robot
Copy link
Contributor

@cprivitere: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@cprivitere
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

5 participants