Scaling ArgoCD to 3000+ Applications: HA and Multi-Cluster Sharding
"A deep dive into scaling ArgoCD for enterprise workloads, focusing on High Availability, Controller Sharding, and Multi-Cluster management."
As organizations grow, so does their GitOps footprint. Managing a couple dozen applications with ArgoCD is a walk in the park — you install it, point it at your repo, and everything just works. It’s beautiful. You start bragging about GitOps on Reddit. Life is good.
Then you hit 3,000+ applications across dozens of Kubernetes clusters.
Suddenly, reconciliation loops take forever. The UI feels like it’s running on a potato. The application-controller starts getting OOM-killed at 3 AM, and your on-call engineer is questioning their career choices. Your Slack channel fills up with “is ArgoCD down again?” messages, and you start wondering if maybe spreadsheets were fine all along.
Spoiler alert: spreadsheets were never fine. ArgoCD can absolutely handle this scale — you just need to know the right levers to pull. In this post, we’ll go deep into how to properly scale ArgoCD in High Availability (HA) mode, implement Controller Sharding, and tune every knob that matters, all based on the official ArgoCD HA documentation, real-world war stories, and community wisdom.
1. Understanding ArgoCD’s Architecture (Before We Break It) Link to heading
Before we start cranking settings to 11, let’s understand what we’re actually scaling. ArgoCD is composed of several key components, and each one has different scaling characteristics:
| Component | What It Does | Stateful? | Scaling Strategy |
|---|---|---|---|
| argocd-server | API server + Web UI | No | Horizontal (Deployment replicas) |
| argocd-repo-server | Clones repos, renders manifests (Helm/Kustomize) | No | Horizontal (Deployment replicas) |
| argocd-application-controller | The “brain” — reconciles desired vs. actual state | Somewhat | Sharding (StatefulSet replicas) |
| argocd-redis | In-memory cache for manifests and app state | Yes (but throw-away) | HA Sentinel (3 nodes) |
| argocd-dex-server | OIDC/SSO authentication | Yes (in-memory DB) | Cannot scale horizontally |
Here’s the critical thing the official docs emphasize: Argo CD is largely stateless. All data is persisted as Kubernetes objects in etcd. Redis is used purely as a throw-away cache — if Redis dies, it will be rebuilt automatically without loss of service. This is great news for HA because it means we don’t have to deal with complicated state replication.
Important Note: The HA installation will require at least three different nodes due to pod anti-affinity rules in the manifests. Plan your node pool accordingly. Also, IPv6-only clusters are currently not supported for HA.
2. Enabling High Availability (HA) Link to heading
The first step is ditching the default “yolo single-instance” installation and switching to the HA manifests. This is the non-negotiable foundation.
Installing with HA Manifests Link to heading
1# The HA manifests — your new best friend
2kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/ha/install.yamlThis gives you:
- Redis HA with 3-node Sentinel setup
- Multiple replicas of
argocd-serverandargocd-repo-server - Pod anti-affinity rules to spread components across nodes
- Proper resource requests and limits
Redis HA: The Unsung Hero Link to heading
Redis is ArgoCD’s caching layer and it’s more important than people think. Every manifest that argocd-repo-server generates gets cached in Redis (for 24 hours by default). Without a healthy Redis, ArgoCD will regenerate manifests from scratch on every reconciliation cycle — and with 3,000 apps, that’s a lot of helm template calls.
The HA setup deploys Redis with a 3-node Sentinel configuration. The Sentinel nodes handle automatic failover if the primary Redis instance dies. A few things to keep in mind:
- Redis is pre-configured to expect exactly three servers/sentinels — don’t try to scale it beyond that without modifying the configuration.
- Redis data is a throw-away cache. If it’s lost, ArgoCD rebuilds it. This might cause a brief spike in repo-server CPU while manifests are regenerated, but no data is permanently lost.
- For extreme scale (5,000+ apps), consider tuning
ARGOCD_APPLICATION_TREE_SHARD_SIZE(default0) to100. This splits the application resource tree across multiple Redis keys, reducing the traffic between the controller and Redis.
Scaling argocd-server Link to heading
The API server is stateless and is probably the least likely to cause issues. However, to ensure zero downtime during upgrades and to handle concurrent UI/API requests from your team, scale it to 3+ replicas:
1apiVersion: apps/v1
2kind: Deployment
3metadata:
4 name: argocd-server
5spec:
6 replicas: 3
7 template:
8 spec:
9 containers:
10 - name: argocd-server
11 env:
12 - name: ARGOCD_API_SERVER_REPLICAS
13 value: "3"The ARGOCD_API_SERVER_REPLICAS environment variable is important — it’s used internally to divide the concurrent login request limit across replicas.
Pro tip for 3000+ apps: Set ARGOCD_GRPC_MAX_SIZE_MB higher than the default 200. When the UI tries to load the full application list, the gRPC response can exceed 200MB. Bumping this to 400 or 500 prevents cryptic “response too large” errors.
Scaling argocd-repo-server Link to heading
This is where a lot of the heavy lifting happens. The repo-server is responsible for:
- Cloning Git repositories and keeping them up to date
- Rendering manifests using Helm, Kustomize, or custom plugins
- Caching the generated manifests in Redis
Each of these can become a bottleneck at scale. Here’s what to tune:
1apiVersion: apps/v1
2kind: Deployment
3metadata:
4 name: argocd-repo-server
5spec:
6 replicas: 5 # Start here for 3000+ apps
7 template:
8 spec:
9 containers:
10 - name: argocd-repo-server
11 env:
12 # Allow more time for complex Helm charts
13 - name: ARGOCD_EXEC_TIMEOUT
14 value: "180s" # Default is 90s
15 # Retry transient git failures
16 - name: ARGOCD_GIT_ATTEMPTS_COUNT
17 value: "3"Key argocd-repo-server settings:
| Setting | Default | Recommended (3000+ apps) | Why |
|---|---|---|---|
--parallelismlimit |
No limit | 10-20 |
Controls concurrent manifest generations. Prevents OOM kills. |
ARGOCD_EXEC_TIMEOUT |
90s |
180s |
Complex Helm charts or Kustomize overlays may need more time. |
ARGOCD_GIT_ATTEMPTS_COUNT |
1 |
3 |
git ls-remote is used frequently; retries prevent transient failures. |
--repo-cache-expiration |
24h |
1h (if needed) |
Reduce if charts change without version bumps. Careful: hurts caching. |
Disk space warning: The repo-server clones repositories into /tmp. If you have many large repositories, the pod will run out of disk space. Mount a PersistentVolume to the repo-server’s /tmp directory to avoid this:
1volumes:
2 - name: repo-tmp
3 persistentVolumeClaim:
4 claimName: argocd-repo-server-tmp
5volumeMounts:
6 - name: repo-tmp
7 mountPath: /tmp3. The Core Challenge: Controller Sharding Link to heading
Alright, here’s where things get real. The argocd-application-controller is the brain of the operation. It’s the component that:
- Watches all your application resources in the cluster
- Fetches rendered manifests from the repo-server
- Compares desired state vs. actual state (the reconciliation loop)
- Triggers syncs when drift is detected
By default, a single controller pod handles all applications across all clusters. For a few dozen apps, this is fine. For 3,000+? That single pod becomes a massive bottleneck. It will eat CPU like candy, require gigabytes of RAM for its in-memory cluster cache, and eventually get OOM-killed. You’ll see Context deadline exceeded errors in the logs as the reconciliation queue overflows.
How Sharding Works Link to heading
ArgoCD shards at the cluster level, not the application level. This is an important distinction. If you have 30 clusters and configure 3 shards, each controller replica (shard) handles ~10 clusters — and all the applications on those clusters.
To enable sharding, you increase the number of replicas in the argocd-application-controller StatefulSet and set the ARGOCD_CONTROLLER_REPLICAS environment variable to match:
1apiVersion: apps/v1
2kind: StatefulSet
3metadata:
4 name: argocd-application-controller
5spec:
6 replicas: 3
7 template:
8 spec:
9 containers:
10 - name: argocd-application-controller
11 env:
12 - name: ARGOCD_CONTROLLER_REPLICAS
13 value: "3"Sharding Algorithms: Choosing Your Distribution Strategy Link to heading
ArgoCD offers three sharding algorithms, configured via controller.sharding.algorithm in argocd-cmd-params-cm (or the --sharding-method flag or ARGOCD_CONTROLLER_SHARDING_ALGORITHM env var):
Legacy (Default) Link to heading
1data:
2 controller.sharding.algorithm: "legacy"Uses a hash function based on the cluster’s UUID. This is the OG algorithm and it sucks at distribution. Seriously. Due to hash collisions, you can end up with one shard managing 60% of your clusters while another sits around twiddling its thumbs. It’s like a load balancer that plays favorites.
Use case: Only if you’re running an ancient ArgoCD version and can’t upgrade.
Round-Robin ⭐ Link to heading
1data:
2 controller.sharding.algorithm: "round-robin"Sorts clusters by UUID and assigns them sequentially to shards. Much more uniform distribution. If you have 12 clusters and 3 shards, you get exactly 4 clusters per shard. Beautiful. Elegant. Chef’s kiss.
Caveat (and it’s a real one): If the cluster at rank-0 is removed, it causes a reshuffle of all clusters across shards. This can temporarily impact performance as controllers rebuild their in-memory caches. Not catastrophic, but worth knowing about during cluster decommissioning.
Use case: The best choice for most production environments in 2025/2026.
Consistent-Hashing Link to heading
1data:
2 controller.sharding.algorithm: "consistent-hashing"Uses a consistent hashing with bounded loads algorithm. This provides a good balance between even distribution and minimal reshuffling when shards or clusters are added/removed. The CNOE blog has extensive benchmarks showing encouraging results.
Use case: Ideal if you frequently add/remove clusters and want to minimize disruption.
Note: Both
round-robinandconsistent-hashingwere introduced as experimental features. They’re now well-tested by the community, but the official docs still carry an “alpha” warning. Don’t let that scare you — plenty of production environments run them successfully.
Setting Up Sharding (Step by Step) Link to heading
Here’s the full config you need in your argocd-cmd-params-cm ConfigMap:
1apiVersion: v1
2kind: ConfigMap
3metadata:
4 name: argocd-cmd-params-cm
5 namespace: argocd
6data:
7 controller.sharding.algorithm: "round-robin"Then scale the StatefulSet:
1kubectl scale statefulset argocd-application-controller --replicas=3 -n argocdAnd don’t forget to set the replica count in the environment variable (or the controller won’t know about the other shards):
1kubectl set env statefulset/argocd-application-controller \
2 ARGOCD_CONTROLLER_REPLICAS=3 -n argocdHow many shards do you need? A rough rule of thumb from community experience:
| Applications | Clusters | Recommended Shards |
|---|---|---|
| < 500 | < 10 | 1 (no sharding needed) |
| 500-1500 | 10-30 | 2-3 |
| 1500-3000 | 30-50 | 3-5 |
| 3000+ | 50+ | 5-10 |
These numbers assume roughly equal distribution of applications across clusters. If you have one mega-cluster with 2,000 apps and 30 small ones, see the next section.
4. Advanced Cluster Assignment: When Auto-Sharding Isn’t Enough Link to heading
Automatic sharding distributes clusters evenly, but clusters aren’t equal. If you have a production cluster with 1,500 applications and a dev cluster with 20, round-robin will happily assign them to the same shard. That shard is now doing 75x more work than the one handling the dev cluster. RIP.
Manual Shard Assignment Link to heading
You can explicitly pin a cluster to a specific shard by setting the shard field in the cluster’s Secret:
1apiVersion: v1
2kind: Secret
3metadata:
4 name: production-megacluster
5 labels:
6 argocd.argoproj.io/secret-type: cluster
7type: Opaque
8stringData:
9 shard: "1" # Pin to shard 1
10 name: production-megacluster.example.com
11 server: https://production-megacluster.example.com
12 config: |
13 {
14 "bearerToken": "<authentication token>",
15 "tlsClientConfig": {
16 "insecure": false,
17 "caData": "<base64 encoded certificate>"
18 }
19 }Now shard 1 is dedicated to the big production cluster, and the other shards handle the remaining, lighter clusters.
The “Fake Clusters” Trick for Single-Cluster Sharding Link to heading
Here’s a trick from the community that’s clever as hell. What if you only have one Kubernetes cluster, but it has thousands of applications? Sharding won’t help because there’s only one cluster to shard.
The workaround? Create multiple Kubernetes ExternalName services that point back to the same API server:
1apiVersion: v1
2kind: Service
3metadata:
4 name: kubernetes-shard-1
5 namespace: default
6spec:
7 type: ExternalName
8 externalName: kubernetes.default.svc.cluster.local
9---
10apiVersion: v1
11kind: Service
12metadata:
13 name: kubernetes-shard-2
14 namespace: default
15spec:
16 type: ExternalName
17 externalName: kubernetes.default.svc.cluster.localThen register each ExternalName service as a separate “cluster” in ArgoCD. Boom — you’ve tricked ArgoCD into thinking it has multiple clusters, and sharding distributes applications across controller replicas.
Is this a hack? Absolutely. Does it work? Also absolutely. The community recommends this for single-cluster setups with thousands of apps. Use RBAC to restrict each “fake cluster” to specific namespaces to get a clean separation.
5. Dynamic Cluster Distribution (ArgoCD 2.9+) Link to heading
One of the most annoying things about the traditional sharding model is that the controller runs as a StatefulSet. Adding or removing shards requires manually setting ARGOCD_CONTROLLER_REPLICAS and restarting all controller pods. That’s downtime. That’s scary. That’s “I hope nothing breaks at 2 AM” territory.
Starting with ArgoCD 2.9, there’s an alpha feature called Dynamic Cluster Distribution that changes the game:
- The application controller can run as a Deployment instead of a StatefulSet
- Clusters are dynamically rebalanced across shards when replicas are scaled up or down
- No more manual
ARGOCD_CONTROLLER_REPLICASenvironment variable juggling - A ConfigMap called
argocd-app-controller-shard-cmtracks the mapping between controller pods and shard numbers, including a heartbeat mechanism
Enabling Dynamic Cluster Distribution Link to heading
To enable this feature, you need to set the ARGOCD_ENABLE_DYNAMIC_CLUSTER_DISTRIBUTION environment variable to true and convert the controller from a StatefulSet to a Deployment (Kustomize overlays are available in the ArgoCD repo for this).
1env:
2 - name: ARGOCD_ENABLE_DYNAMIC_CLUSTER_DISTRIBUTION
3 value: "true"Warning: This is still an alpha feature. It works well in many production environments, but test thoroughly in staging first. The ArgoCD team is actively collecting community feedback to move it to production-ready status.
6. Performance Tuning for 3000+ Applications Link to heading
Sharding gets you most of the way there, but squeezing out peak performance requires tuning the internal processing pipelines.
Reconciliation Processors Link to heading
Each controller replica uses two separate queues:
- Status processors — Handle application reconciliation (checking desired vs. actual state). Default: 20
- Operation processors — Handle sync operations. Default: 10
For 3,000+ apps, you need to crank these up significantly. The official docs suggest 50 status processors and 25 operation processors for 1,000 applications. Extrapolating:
| Applications (per shard) | --status-processors |
--operation-processors |
|---|---|---|
| < 500 | 20 (default) | 10 (default) |
| 500-1000 | 50 | 25 |
| 1000-2000 | 80 | 40 |
| 2000+ | 100+ | 50+ |
Configure these in argocd-cmd-params-cm:
1data:
2 controller.status.processors: "50"
3 controller.operation.processors: "25"Heads up: More processors = more CPU and memory usage. Monitor your controller pods with Prometheus and adjust accordingly. There’s a sweet spot — going too high just wastes resources without improving throughput.
Reconciliation Timeout and Jitter Link to heading
The controller polls Git every 3 minutes by default. You can change this via timeout.reconciliation in argocd-cm:
1apiVersion: v1
2kind: ConfigMap
3metadata:
4 name: argocd-cm
5data:
6 timeout.reconciliation: "300s" # 5 minutes
7 timeout.reconciliation.jitter: "60s" # up to 60s of random jitterWhy jitter matters: Without jitter, all 3,000 applications hit the reconciliation timeout at roughly the same time, creating a thundering herd problem. The repo-server gets slammed with 3,000 manifest generation requests simultaneously. Jitter spreads these out over a window (timeout + 0 to jitter seconds), smoothing the load curve.
The ARGOCD_RECONCILIATION_JITTER environment variable controls this too (defaults to 60 seconds). For large installations, bump it to 120 or 180 to spread things out even more.
Kubernetes API Server Interactions Link to heading
The controller maintains a lightweight in-memory cache of Kubernetes resources using the Watch API. This avoids querying the K8s API during reconciliation and is a huge performance win. However, there are some gotchas:
-
ARGO_CD_UPDATE_CLUSTER_INFO_TIMEOUT: The controller updates cluster info every 10 seconds. If your cluster network has latency issues, increase this timeout (value in seconds). -
ARGOCD_CLUSTER_CACHE_LIST_PAGE_BUFFER_SIZE: For clusters with a massive number of resources (think: Azure clusters with thousands of CRDs fromazure-service-operator), the initial cache sync might time out. Increase this buffer size so the controller can pre-fetch more pages from the K8s API before the etcd compaction interval expires. -
ARGOCD_CLUSTER_CACHE_BATCH_EVENTS_PROCESSING: Enabled by default (true). Collects Kubernetes watch events and processes them in batches rather than one-by-one. Leave this on. -
ARGOCD_CLUSTER_CACHE_EVENTS_PROCESSING_INTERVAL: Controls the batch interval (default100ms). If the controller is overwhelmed by events from a very active cluster, increase this to200msor500ms. -
The controller caches only the preferred versions of resources. If your Git manifests use a non-preferred API version, the controller falls back to direct K8s API queries (slow!). Always use the preferred API version in your manifests to avoid this.
The GOMEMLIMIT Trick Link to heading
This one comes straight from real-world battle scars. ArgoCD is written in Go, and Go’s garbage collector can be… overly relaxed. During controller restarts or full reconciliations, memory usage spikes dramatically as thousands of objects are allocated. If the spike exceeds the container’s memory limit, the OOM killer strikes.
The fix? Set the GOMEMLIMIT environment variable to ~90% of your container’s memory limit:
1env:
2 - name: GOMEMLIMIT
3 value: "1800MiB" # If your limit is 2GiThis tells Go’s runtime to be more aggressive about garbage collection as memory approaches the limit. You’ll trade a tiny bit of CPU for dramatically fewer OOM kills. It’s one of those “why isn’t this the default” settings.
Repo Server Timeout Link to heading
When the application controller asks the repo-server to generate manifests, there’s a timeout (--repo-server-timeout-seconds, default 60s). If your Helm charts are complex (looking at you, kube-prometheus-stack), manifest generation can exceed this.
If you see Context deadline exceeded errors in the controller logs, bump this up:
1data:
2 controller.repo.server.timeout.seconds: "180"7. Monorepo Scaling Considerations Link to heading
Ah, the monorepo. Everyone loves it until ArgoCD has to deal with it.
The repo-server maintains one clone of each repository. If manifest generation modifies files in the local clone (some tools do this), only one manifest generation can run at a time for that repo. With 50+ applications in a monorepo, this serialization kills performance.
Enable Concurrent Processing Link to heading
The repo-server checks whether manifest generation has side effects and parallelizes when safe. Known scenarios and workarounds:
-
Multiple Helm apps in the same directory: Starting with ArgoCD v3.0, Helm manifest generation is parallel by default. If you’re on an older version, add a
.argocd-allow-concurrencyfile to the chart directory. -
Custom plugin apps: Avoid writing temporary files during generation. Add
.argocd-allow-concurrencyto the app directory, or use the sidecar plugin option which processes each application with a temporary copy of the repository. -
Kustomize with parameter overrides: Sorry, no workaround for this one. If you’re using Kustomize parameter overrides on multiple apps in the same repo, they’ll be serialized. Consider restructuring to avoid overrides.
Manifest Paths Annotation (This Is Huge!) Link to heading
By default, ArgoCD uses the repository commit SHA as the cache key. A new commit to any file in the repo invalidates the cache for all applications in that repo. If your monorepo has 200 apps and someone changes a README, all 200 apps regenerate their manifests. Wasteful.
The argocd.argoproj.io/manifest-generate-paths annotation tells ArgoCD which paths actually matter for each application:
1apiVersion: argoproj.io/v1alpha1
2kind: Application
3metadata:
4 name: payments-service
5 annotations:
6 # Only regenerate if files in these paths changed
7 argocd.argoproj.io/manifest-generate-paths: ".;../shared"
8spec:
9 source:
10 repoURL: https://github.com/myorg/k8s-manifests.git
11 targetRevision: HEAD
12 path: services/paymentsNow, if someone commits a change to services/auth/, the payments-service application won’t regenerate its manifests. This is a massive performance win for monorepos.
You can use:
- Relative paths (
.= app’s own directory,../shared= sibling directory) - Absolute paths (
/shared/base-configs) - Glob patterns (
/shared/*-secret.yaml) - Multiple paths separated by semicolons (
;)
Note: Since ArgoCD v2.11, this annotation works without configuring webhooks. You can rely on it standalone for all manifest generation optimization.
Shallow Cloning Link to heading
For large repositories with extensive Git history, cloning can be painfully slow. Enable shallow cloning with depth: "1" in the repository secret:
1apiVersion: v1
2kind: Secret
3metadata:
4 name: my-monorepo
5 labels:
6 argocd.argoproj.io/secret-type: repository
7type: Opaque
8stringData:
9 depth: "1"
10 type: "git"
11 url: "https://github.com/myorg/k8s-manifests.git"Or via CLI:
1argocd repo add https://github.com/myorg/k8s-manifests.git --depth 1This clones only the most recent commit instead of the full history. For a monorepo with 10,000+ commits, this can reduce clone time from minutes to seconds.
8. Rate Limiting and Workqueue Tuning Link to heading
When things go wrong (and they will), applications can get into sync loops. A single misbehaving app can cause the reconciliation queue to spiral, impacting all applications. ArgoCD provides rate limiting controls to prevent this.
Global Rate Limits Link to heading
A simple bucket-based rate limiter that controls how many items can be queued per second:
1env:
2 - name: WORKQUEUE_BUCKET_SIZE
3 value: "500" # Max items in a single burst (default: 500)
4 - name: WORKQUEUE_BUCKET_QPS
5 value: "50" # Items queued per second (default: unlimited!)The default WORKQUEUE_BUCKET_QPS is MaxFloat64 — which means it’s effectively disabled. For large installations, set this to a sane value (e.g., 50-100) to prevent a cascade of reconciliations from overwhelming the controller.
Per-Item Rate Limits Link to heading
This limits how many times a specific application can be re-queued in a short period. It uses exponential backoff:
1env:
2 # Enable exponential backoff (disabled by default)
3 - name: WORKQUEUE_FAILURE_COOLDOWN_NS
4 value: "10000000000" # 10 seconds in nanoseconds
5 - name: WORKQUEUE_BASE_DELAY_NS
6 value: "1000000" # 1ms initial backoff
7 - name: WORKQUEUE_MAX_DELAY_NS
8 value: "10000000000" # 10s max backoff
9 - name: WORKQUEUE_BACKOFF_FACTOR
10 value: "2.0" # Double the backoff each retryThe formula: if an application keeps getting re-queued before the cooldown period expires, the backoff increases exponentially. If the cooldown period passes without the item being re-queued, the backoff resets. This is incredibly useful for preventing a broken application from monopolizing controller resources.
9. Monitoring, Profiling, and Debugging Link to heading
You can’t optimize what you can’t measure. ArgoCD exposes Prometheus metrics on each component’s metrics port.
Key Metrics to Watch Link to heading
| Metric | What to Look For |
|---|---|
argocd_app_reconcile_duration_seconds |
If this keeps climbing, your controller is struggling |
argocd_app_sync_total |
Rate of sync operations |
argocd_cluster_api_resource_objects |
Number of tracked Kubernetes objects per cluster |
argocd_redis_request_duration |
Redis latency — spikes indicate cache pressure |
argocd_git_request_duration_seconds |
Git operation latency |
argocd_repo_pending_request_total |
Queued manifest generation requests |
CPU/Memory Profiling Link to heading
ArgoCD optionally exposes Go’s pprof profiling endpoint. Enable it in argocd-cmd-params-cm:
1data:
2 controller.profile.enabled: "true"Then profile with:
1kubectl port-forward svc/argocd-metrics 8082:8082
2go tool pprof http://localhost:8082/debug/pprof/heap # Memory profile
3go tool pprof http://localhost:8082/debug/pprof/profile # CPU profile (30s)This is invaluable when hunting down why a specific controller shard is using 4GB of RAM. The heap profile will show you exactly which objects are eating memory.
gRPC Performance Metrics Link to heading
For deep troubleshooting, enable gRPC time histograms:
1env:
2 - name: ARGOCD_ENABLE_GRPC_TIME_HISTOGRAM
3 value: "true"Warning: These metrics are expensive to both collect and store. Only enable them when actively debugging performance issues, then disable when done.
The Nuclear Option: Rolling Restarts Link to heading
Sometimes, after a particularly bad incident (network outage, API server brownout, etc.), the reconciliation queue gets hopelessly backed up. The fastest fix? Roll the controller pods:
1kubectl rollout restart statefulset argocd-application-controller -n argocdThis clears the queue and rebuilds the in-memory cache from scratch. It’s not elegant, but it works. Think of it as the “turn it off and on again” of GitOps.
10. Summary: The Scaling Checklist Link to heading
Scaling ArgoCD is a journey of methodically removing bottlenecks. Here’s the complete playbook for a 3,000+ application environment:
Foundation Link to heading
- Deploy using HA manifests (3+ nodes, Redis Sentinel)
- Scale
argocd-serverto 3+ replicas withARGOCD_API_SERVER_REPLICAS - Scale
argocd-repo-serverto 5+ replicas with--parallelismlimit 10-20 - Increase
ARGOCD_GRPC_MAX_SIZE_MBto400+
Sharding Link to heading
- Enable round-robin or consistent-hashing sharding
- Scale
argocd-application-controllerto appropriate replica count - Use manual shard assignment for outlier (large) clusters
- Consider dynamic cluster distribution (ArgoCD 2.9+) for easier scaling
Performance Tuning Link to heading
- Increase
status-processorsandoperation-processors - Enable reconciliation jitter to prevent thundering herd
- Set
GOMEMLIMITto 90% of container memory limit - Increase
repo-server-timeout-secondsfor complex charts
Monorepo Optimization Link to heading
- Add
manifest-generate-pathsannotations to all applications - Enable shallow cloning for large repos
- Add
.argocd-allow-concurrencyfiles where applicable
Monitoring Link to heading
- Set up Prometheus + Grafana dashboards for ArgoCD metrics
- Enable CPU/memory profiling for troubleshooting sessions
- Configure rate limiting to prevent cascading failures
By following this playbook, your ArgoCD installation will handle 3,000+ applications like a champ. Your on-call engineers will sleep better, your developers will get faster feedback loops, and you can go back to bragging about GitOps on Reddit — this time with the receipts to back it up.
Have questions or war stories of your own? Find me on LinkedIn or reach out — I love talking about this stuff.