Replies: 3 comments
-
we use an approach where workloads have specific key values. |
Beta Was this translation helpful? Give feedback.
-
@jchikatur I know your post is a bit old but wanted to share some recommendations. I keep a blog updated with a comprehensive overview of Argo architectures along some resources linked on scaling Argo CD effectively and what to look out for. In general, I would consider:
On the last point, how related are these clusters? It sounds like you would want to have a strategy to rollout a new version of an application to all of these different endpoints and maintain visibility into that rollout along with decision mapping rollbacks etc. If these are edge clusters, I typically recommend using an Argo CD instance per cluster. In either case this is best managed with external systems like scripted CI/CD, or Codefresh etc. Progressive sync is a new experimental feature that could also be useful. For the scale side, it sounds like you have a good handle on the levers you'll have to pull to shard clusters etc though I think this is an area in Argo CD with less testing and fewer people validating so you'll want to test and make sure shard performance is where you want it. On the point about blast radius, Argo CD is very reliable but depending on the scale I would think differently about noisy neighbors and the general impact of potential failures. The repo strategy is going to be critical here. If you go with a monorepo you'll need to think about Codeowners files (which you should think about anyway), and the size of the repo and the performance impact that will have as Argo CD does not yet support sparse checkout etc. For all the other items, your repo strategy is probably at least as important as your instances. |
Beta Was this translation helpful? Give feedback.
-
@jchikatur, which approach did you eventually take? |
Beta Was this translation helpful? Give feedback.
-
Context
In our org, we are currently in the process of enabling ArgoCD in our environment to manage a huge number of k8s clusters (projected around 10k). Each cluster will have around 40 cluster add-ons enabled thru applicationsets. So a total of around 400k applications in total at the minimum, even before other user applications are deployed and managed. The ArgoCD instances will be managed in a centralized manner and provided as a service for other teams in our org to deploy and manage their applications on k8s clusters. Teams will be provided appropriate accesses thru RBAC.
To better manage these applications and clusters, and decrease the blast radius, we decided to go with multiple ArgoCD instances, each managing a subset of these clusters (we still havent finalized on the number of instances). Each instance will have appropriate sharding of Application controller to handle large number of clusters and applications.
Approaches
The problem we seem to be running into is what is the strategy we should use to divide the clusters among multiple instances.
First Approach - Region based ArgoCD instances
Till now we were leaning towards region based division, where we setup one or more ArgoCD instances per cloud region (we have a combination of public and private cloud on each region) to manage the clusters in that region. But there seems to be a few issues with that approach.
We wanted to understand if this approach has been explored before?
Can the issues described above be solved in any way?
Are there any potential issues you see with this approach?
Second Approach - Usecase based ArgoCD instances
The second approach we are thinking of is to have ArgoCD instance(s) for each type of cluster. We have different types like dedicated, shared. Under shared, there are multiple usecases and environments. This way we can divide the workloads among multiple ArgoCD instances. Do you see any merits to this approach?
We could have multiple ArgoCD instances to manage a particular usecase if the number of clusters and applications become too high for a single ArgoCD instance to handle. But this could lead to the same problems with the applicationsets mentioned earlier.
Are there any another approaches you would recommend for our scenario, that we can explore?
All of your inputs will be really helpful for us to make a decision on our ArgoCD scaling effort, and choose the best possible approach.
Beta Was this translation helpful? Give feedback.
All reactions