Need help in determining the best approach to use multiple ArgoCD instances to handle 10k clusters and 400k applications #15345

jchikatur · 2023-09-04T17:39:32Z

jchikatur
Sep 4, 2023

Context

In our org, we are currently in the process of enabling ArgoCD in our environment to manage a huge number of k8s clusters (projected around 10k). Each cluster will have around 40 cluster add-ons enabled thru applicationsets. So a total of around 400k applications in total at the minimum, even before other user applications are deployed and managed. The ArgoCD instances will be managed in a centralized manner and provided as a service for other teams in our org to deploy and manage their applications on k8s clusters. Teams will be provided appropriate accesses thru RBAC.

To better manage these applications and clusters, and decrease the blast radius, we decided to go with multiple ArgoCD instances, each managing a subset of these clusters (we still havent finalized on the number of instances). Each instance will have appropriate sharding of Application controller to handle large number of clusters and applications.

Approaches

The problem we seem to be running into is what is the strategy we should use to divide the clusters among multiple instances.

First Approach - Region based ArgoCD instances

Till now we were leaning towards region based division, where we setup one or more ArgoCD instances per cloud region (we have a combination of public and private cloud on each region) to manage the clusters in that region. But there seems to be a few issues with that approach.

Applicationsets - a team in our org might own a small subset of clusters spread across regions, and with this approach spread across ArgoCD instances. If they want applicationsets for those clusters, how would we go about it? It seems to me that we would effectively lose this functionality with this approach.
Deployment and decision to rollback - a team might have a multi-cluster application, where the clusters can be managed by different ArgoCD instances. We would need an external system that can maintain an inventory of cluster-ArgoCD instance mapping data (which we are anyway building for a different purpose), and another orchestrator that can correlate the data from the inventory to the deployment requirements and create applications on those ArgoCD instances. We would need a system for making the decision to rollback in case of any issue on one of the clusters on which application is deployed. Is there an Argo native approach (for example, a solution involving Rollouts, workflows) we can explore?

We wanted to understand if this approach has been explored before?
Can the issues described above be solved in any way?
Are there any potential issues you see with this approach?

Second Approach - Usecase based ArgoCD instances

The second approach we are thinking of is to have ArgoCD instance(s) for each type of cluster. We have different types like dedicated, shared. Under shared, there are multiple usecases and environments. This way we can divide the workloads among multiple ArgoCD instances. Do you see any merits to this approach?
We could have multiple ArgoCD instances to manage a particular usecase if the number of clusters and applications become too high for a single ArgoCD instance to handle. But this could lead to the same problems with the applicationsets mentioned earlier.

Are there any another approaches you would recommend for our scenario, that we can explore?

All of your inputs will be really helpful for us to make a decision on our ArgoCD scaling effort, and choose the best possible approach.

FernandoMiguel · 2023-09-05T11:56:43Z

FernandoMiguel
Sep 5, 2023

we use an approach where workloads have specific key values.
each cluster has some of those keys too
each applicationset matrix matches those. when they match, the workload is deployed.
so for your approach, since each Argo CD cluster would be trying to match workloads of each target cluster they manage, only those each cluster care, would see it being deployed.
so a workload can state multiple destinations, and each argo deploys it to the target it manages.

0 replies

todaywasawesome · 2024-03-12T16:30:46Z

todaywasawesome
Mar 12, 2024
Collaborator

@jchikatur I know your post is a bit old but wanted to share some recommendations. I keep a blog updated with a comprehensive overview of Argo architectures along some resources linked on scaling Argo CD effectively and what to look out for.

In general, I would consider:

How will this architecture support my deployment strategy?
What's the developer/operator experience?
Managing the blast radius of a failed Argo instance
Managing permissions and access
Controlling for special networking considerations (are the clusters on a VPC, at edge etc?)

On the last point, how related are these clusters? It sounds like you would want to have a strategy to rollout a new version of an application to all of these different endpoints and maintain visibility into that rollout along with decision mapping rollbacks etc. If these are edge clusters, I typically recommend using an Argo CD instance per cluster. In either case this is best managed with external systems like scripted CI/CD, or Codefresh etc. Progressive sync is a new experimental feature that could also be useful.

For the scale side, it sounds like you have a good handle on the levers you'll have to pull to shard clusters etc though I think this is an area in Argo CD with less testing and fewer people validating so you'll want to test and make sure shard performance is where you want it.

On the point about blast radius, Argo CD is very reliable but depending on the scale I would think differently about noisy neighbors and the general impact of potential failures.

The repo strategy is going to be critical here. If you go with a monorepo you'll need to think about Codeowners files (which you should think about anyway), and the size of the repo and the performance impact that will have as Argo CD does not yet support sparse checkout etc. For all the other items, your repo strategy is probably at least as important as your instances.

0 replies

ronklein · 2024-05-02T08:56:19Z

ronklein
May 2, 2024

@jchikatur, which approach did you eventually take?
We are also looking for a solution to deploy about 500k applications (spread over 10k clusters).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need help in determining the best approach to use multiple ArgoCD instances to handle 10k clusters and 400k applications #15345

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Need help in determining the best approach to use multiple ArgoCD instances to handle 10k clusters and 400k applications #15345

jchikatur Sep 4, 2023

Context

Approaches

First Approach - Region based ArgoCD instances

Second Approach - Usecase based ArgoCD instances

Replies: 3 comments

FernandoMiguel Sep 5, 2023

todaywasawesome Mar 12, 2024 Collaborator

ronklein May 2, 2024

jchikatur
Sep 4, 2023

FernandoMiguel
Sep 5, 2023

todaywasawesome
Mar 12, 2024
Collaborator

ronklein
May 2, 2024