Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guidance for rate limiting configuration for large cluster #5904

Open
desek opened this issue Apr 11, 2024 · 0 comments
Open

Guidance for rate limiting configuration for large cluster #5904

desek opened this issue Apr 11, 2024 · 0 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@desek
Copy link

desek commented Apr 11, 2024

What happened:

We're running a large cluster of standard VM's that auto-scales between 50-650 nodes. During peak this results in:

  • ~600 services of type LoadBalancer with a Public IP (TCP-based service, so Ingress is not an option)
  • ~10000 Pods in the cluster

When running the cloud-controller-manager without any rate limiter configuration it eventually reconciles after a couple hours and during those hours we get at lot of 429's and basically makes the subscription unusable during that time. More or less every operation gets throttled by Azure (disk attach/detach, VM scaling/reconfiguration etc.)

When running the cloud-controller-manager with rate limiter I've been trying a lot of QPS/Bucket combinations that always ends up with client throttling at a state where the services never reconciles (= they don't get a public IP).

What you expected to happen:

Service reconcile "as fast as possible".

How to reproduce it (as minimally and precisely as possible):

azure.json

{
    "aadClientId": "redacted",
    "aadClientSecret": "redacted",
    "cloud": "AzurePublicCloud",
    "cloudProviderBackoff": true,
    "cloudProviderBackoffDuration": 5,
    "cloudProviderBackoffRetries": 1,
    "cloudProviderRateLimit": true,
    "cloudProviderRateLimitBucket": 100,
    "cloudProviderRateLimitBucketWrite": 100,
    "cloudProviderRateLimitQPS": 10,
    "cloudProviderRateLimitQPSWrite": 10,
    "disableAvailabilitySetNodes": false,
    "enableVmssFlexNodes": true,
    "loadBalancerBackendPoolConfigurationType": "nodeIP",
    "loadBalancerName": "",
    "loadBalancerRateLimit": {
        "cloudProviderRateLimit": true,
        "cloudProviderRateLimitBucket": 100,
        "cloudProviderRateLimitBucketWrite": 100,
        "cloudProviderRateLimitQPS": 10,
        "cloudProviderRateLimitQPSWrite": 10,
    },
    "loadBalancerSku": "Standard",
    "location": "northeurope",
    "maximumLoadBalancerRuleCount": 500,
    "multipleStandardLoadBalancerConfigurations": [
        {
            "name": "lin",
            "nodeSelector": {
                "matchLabels": {
                    "agentpool": "lin"
                }
            },
            "primaryVMSet": "lin"
        },
        {
            "name": "win",
            "nodeSelector": {
                "matchLabels": {
                    "agentpool": "win"
                }
            },
            "primaryVMSet": "win"
        }
    ],
    "publicIPAddressRateLimit": {
        "cloudProviderRateLimit": true,
        "cloudProviderRateLimitBucket": 100,
        "cloudProviderRateLimitBucketWrite": 100,
        "cloudProviderRateLimitQPS": 10,
        "cloudProviderRateLimitQPSWrite": 10,
    },
    "resourceGroup": "redacted",
    "routeTableName": "redacted",
    "securityGroupName": "redacted",
    "securityGroupResourceGroup": "redacted",
    "subnetName": "redacted",
    "subscriptionId": "redacted",
    "tenantId": "redacted",
    "useInstanceMetadata": true,
    "useManagedIdentityExtension": false,
    "virtualMachineRateLimit": {
        "cloudProviderRateLimit": true,
        "cloudProviderRateLimitBucket": 100,
        "cloudProviderRateLimitBucketWrite": 100,
        "cloudProviderRateLimitQPS": 10,
        "cloudProviderRateLimitQPSWrite": 10,
    },
    "vmType": "standard",
    "vnetName": "redacted",
    "vnetResourceGroup": "redacted"
}

Anything else we need to know?:

  • Tested with cloud-controller-manager:v1.29.3

Sample error:

I0411 08:32:54.660563       1 event.go:376] "Event occurred" object="3d4e3d44-98c4-4265-8a13-03be2d0bd1cd/media" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: Retriable: true, RetryAfter: 11s, HTTPStatusCode: 0, RawError: azure cloud provider throttled for operation VMList with reason \"client throttled\""
I0411 08:34:49.199640       1 controller.go:398] Ensuring load balancer for service 5bba9f76-80ba-4151-8358-fbcacf137cb2/media
I0411 08:34:49.199697       1 event.go:376] "Event occurred" object="44e8c3c7-347c-4dff-9d77-8b16c8affeff/media" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message=<
	Error syncing load balancer: failed to ensure load balancer: Retriable: false, RetryAfter: 16s, HTTPStatusCode: 429, RawError: {
	  "error": {
	    "details": [
	      {
	        "code": "TooManyRequests",
	        "message": "{\"operationGroup\":\"HighCostGetSubscriptionMaximum\",\"startTime\":\"2024-04-11T08:34:49.1938207+00:00\",\"endTime\":\"2024-04-11T08:35:05.4063841+00:00\",\"allowedRequestCount\":300,\"measuredRequestCount\":301}",
	        "target": "HighCostGetSubscriptionMaximum"
	      }
	    ],
	    "innererror": {
	      "internalErrorCode": "TooManyRequestsReceived"
	    },
	    "code": "OperationNotAllowed",
	    "message": "The server rejected the request because too many requests have been received for this subscription."
	  }
	}
 >

Environment:

  • Kubernetes version (use kubectl version): 1.29.1
  • Cloud provider or hardware configuration: Azure standard VM's
  • OS (e.g: cat /etc/os-release): Ubuntu 22 and Windows
  • Kernel (e.g. uname -a):
  • Install tools: CAPI with CAPZ
  • Network plugin and version (if this is a network-related bug): Calico 3.26.1
  • Others:
@desek desek added the kind/bug Categorizes issue or PR as related to a bug. label Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

1 participant