DevOps Kubernetes GitOps

Scaling ArgoCD to 3000+ Applications: HA and Multi-Cluster Sharding

"A deep dive into scaling ArgoCD for enterprise workloads, focusing on High Availability, Controller Sharding, and Multi-Cluster management."

Scaling ArgoCD to 3000+ Applications: HA and Multi-Cluster Sharding

As organizations grow, so does their GitOps footprint. Managing a couple dozen applications with ArgoCD is a walk in the park — you install it, point it at your repo, and everything just works. It’s beautiful. You start bragging about GitOps on Reddit. Life is good.

Then you hit 3,000+ applications across dozens of Kubernetes clusters.

Suddenly, reconciliation loops take forever. The UI feels like it’s running on a potato. The application-controller starts getting OOM-killed at 3 AM, and your on-call engineer is questioning their career choices. Your Slack channel fills up with “is ArgoCD down again?” messages, and you start wondering if maybe spreadsheets were fine all along.

Spoiler alert: spreadsheets were never fine. ArgoCD can absolutely handle this scale — you just need to know the right levers to pull. In this post, we’ll go deep into how to properly scale ArgoCD in High Availability (HA) mode, implement Controller Sharding, and tune every knob that matters, all based on the official ArgoCD HA documentation, real-world war stories, and community wisdom.


1. Understanding ArgoCD’s Architecture (Before We Break It) Link to heading

Before we start cranking settings to 11, let’s understand what we’re actually scaling. ArgoCD is composed of several key components, and each one has different scaling characteristics:

Component What It Does Stateful? Scaling Strategy
argocd-server API server + Web UI No Horizontal (Deployment replicas)
argocd-repo-server Clones repos, renders manifests (Helm/Kustomize) No Horizontal (Deployment replicas)
argocd-application-controller The “brain” — reconciles desired vs. actual state Somewhat Sharding (StatefulSet replicas)
argocd-redis In-memory cache for manifests and app state Yes (but throw-away) HA Sentinel (3 nodes)
argocd-dex-server OIDC/SSO authentication Yes (in-memory DB) Cannot scale horizontally

Here’s the critical thing the official docs emphasize: Argo CD is largely stateless. All data is persisted as Kubernetes objects in etcd. Redis is used purely as a throw-away cache — if Redis dies, it will be rebuilt automatically without loss of service. This is great news for HA because it means we don’t have to deal with complicated state replication.

Important Note: The HA installation will require at least three different nodes due to pod anti-affinity rules in the manifests. Plan your node pool accordingly. Also, IPv6-only clusters are currently not supported for HA.


2. Enabling High Availability (HA) Link to heading

The first step is ditching the default “yolo single-instance” installation and switching to the HA manifests. This is the non-negotiable foundation.

Installing with HA Manifests Link to heading

bash
1# The HA manifests — your new best friend
2kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/ha/install.yaml

This gives you:

  • Redis HA with 3-node Sentinel setup
  • Multiple replicas of argocd-server and argocd-repo-server
  • Pod anti-affinity rules to spread components across nodes
  • Proper resource requests and limits

Redis HA: The Unsung Hero Link to heading

Redis is ArgoCD’s caching layer and it’s more important than people think. Every manifest that argocd-repo-server generates gets cached in Redis (for 24 hours by default). Without a healthy Redis, ArgoCD will regenerate manifests from scratch on every reconciliation cycle — and with 3,000 apps, that’s a lot of helm template calls.

The HA setup deploys Redis with a 3-node Sentinel configuration. The Sentinel nodes handle automatic failover if the primary Redis instance dies. A few things to keep in mind:

  • Redis is pre-configured to expect exactly three servers/sentinels — don’t try to scale it beyond that without modifying the configuration.
  • Redis data is a throw-away cache. If it’s lost, ArgoCD rebuilds it. This might cause a brief spike in repo-server CPU while manifests are regenerated, but no data is permanently lost.
  • For extreme scale (5,000+ apps), consider tuning ARGOCD_APPLICATION_TREE_SHARD_SIZE (default 0) to 100. This splits the application resource tree across multiple Redis keys, reducing the traffic between the controller and Redis.

Scaling argocd-server Link to heading

The API server is stateless and is probably the least likely to cause issues. However, to ensure zero downtime during upgrades and to handle concurrent UI/API requests from your team, scale it to 3+ replicas:

yaml
 1apiVersion: apps/v1
 2kind: Deployment
 3metadata:
 4  name: argocd-server
 5spec:
 6  replicas: 3
 7  template:
 8    spec:
 9      containers:
10        - name: argocd-server
11          env:
12            - name: ARGOCD_API_SERVER_REPLICAS
13              value: "3"

The ARGOCD_API_SERVER_REPLICAS environment variable is important — it’s used internally to divide the concurrent login request limit across replicas.

Pro tip for 3000+ apps: Set ARGOCD_GRPC_MAX_SIZE_MB higher than the default 200. When the UI tries to load the full application list, the gRPC response can exceed 200MB. Bumping this to 400 or 500 prevents cryptic “response too large” errors.

Scaling argocd-repo-server Link to heading

This is where a lot of the heavy lifting happens. The repo-server is responsible for:

  1. Cloning Git repositories and keeping them up to date
  2. Rendering manifests using Helm, Kustomize, or custom plugins
  3. Caching the generated manifests in Redis

Each of these can become a bottleneck at scale. Here’s what to tune:

yaml
 1apiVersion: apps/v1
 2kind: Deployment
 3metadata:
 4  name: argocd-repo-server
 5spec:
 6  replicas: 5  # Start here for 3000+ apps
 7  template:
 8    spec:
 9      containers:
10        - name: argocd-repo-server
11          env:
12            # Allow more time for complex Helm charts
13            - name: ARGOCD_EXEC_TIMEOUT
14              value: "180s"  # Default is 90s
15            # Retry transient git failures
16            - name: ARGOCD_GIT_ATTEMPTS_COUNT
17              value: "3"

Key argocd-repo-server settings:

Setting Default Recommended (3000+ apps) Why
--parallelismlimit No limit 10-20 Controls concurrent manifest generations. Prevents OOM kills.
ARGOCD_EXEC_TIMEOUT 90s 180s Complex Helm charts or Kustomize overlays may need more time.
ARGOCD_GIT_ATTEMPTS_COUNT 1 3 git ls-remote is used frequently; retries prevent transient failures.
--repo-cache-expiration 24h 1h (if needed) Reduce if charts change without version bumps. Careful: hurts caching.

Disk space warning: The repo-server clones repositories into /tmp. If you have many large repositories, the pod will run out of disk space. Mount a PersistentVolume to the repo-server’s /tmp directory to avoid this:

yaml
1volumes:
2  - name: repo-tmp
3    persistentVolumeClaim:
4      claimName: argocd-repo-server-tmp
5volumeMounts:
6  - name: repo-tmp
7    mountPath: /tmp

3. The Core Challenge: Controller Sharding Link to heading

Alright, here’s where things get real. The argocd-application-controller is the brain of the operation. It’s the component that:

  • Watches all your application resources in the cluster
  • Fetches rendered manifests from the repo-server
  • Compares desired state vs. actual state (the reconciliation loop)
  • Triggers syncs when drift is detected

By default, a single controller pod handles all applications across all clusters. For a few dozen apps, this is fine. For 3,000+? That single pod becomes a massive bottleneck. It will eat CPU like candy, require gigabytes of RAM for its in-memory cluster cache, and eventually get OOM-killed. You’ll see Context deadline exceeded errors in the logs as the reconciliation queue overflows.

How Sharding Works Link to heading

ArgoCD shards at the cluster level, not the application level. This is an important distinction. If you have 30 clusters and configure 3 shards, each controller replica (shard) handles ~10 clusters — and all the applications on those clusters.

To enable sharding, you increase the number of replicas in the argocd-application-controller StatefulSet and set the ARGOCD_CONTROLLER_REPLICAS environment variable to match:

yaml
 1apiVersion: apps/v1
 2kind: StatefulSet
 3metadata:
 4  name: argocd-application-controller
 5spec:
 6  replicas: 3
 7  template:
 8    spec:
 9      containers:
10        - name: argocd-application-controller
11          env:
12            - name: ARGOCD_CONTROLLER_REPLICAS
13              value: "3"

Sharding Algorithms: Choosing Your Distribution Strategy Link to heading

ArgoCD offers three sharding algorithms, configured via controller.sharding.algorithm in argocd-cmd-params-cm (or the --sharding-method flag or ARGOCD_CONTROLLER_SHARDING_ALGORITHM env var):

Legacy (Default) Link to heading

yaml
1data:
2  controller.sharding.algorithm: "legacy"

Uses a hash function based on the cluster’s UUID. This is the OG algorithm and it sucks at distribution. Seriously. Due to hash collisions, you can end up with one shard managing 60% of your clusters while another sits around twiddling its thumbs. It’s like a load balancer that plays favorites.

Use case: Only if you’re running an ancient ArgoCD version and can’t upgrade.

Round-Robin ⭐ Link to heading

yaml
1data:
2  controller.sharding.algorithm: "round-robin"

Sorts clusters by UUID and assigns them sequentially to shards. Much more uniform distribution. If you have 12 clusters and 3 shards, you get exactly 4 clusters per shard. Beautiful. Elegant. Chef’s kiss.

Caveat (and it’s a real one): If the cluster at rank-0 is removed, it causes a reshuffle of all clusters across shards. This can temporarily impact performance as controllers rebuild their in-memory caches. Not catastrophic, but worth knowing about during cluster decommissioning.

Use case: The best choice for most production environments in 2025/2026.

Consistent-Hashing Link to heading

yaml
1data:
2  controller.sharding.algorithm: "consistent-hashing"

Uses a consistent hashing with bounded loads algorithm. This provides a good balance between even distribution and minimal reshuffling when shards or clusters are added/removed. The CNOE blog has extensive benchmarks showing encouraging results.

Use case: Ideal if you frequently add/remove clusters and want to minimize disruption.

Note: Both round-robin and consistent-hashing were introduced as experimental features. They’re now well-tested by the community, but the official docs still carry an “alpha” warning. Don’t let that scare you — plenty of production environments run them successfully.

Setting Up Sharding (Step by Step) Link to heading

Here’s the full config you need in your argocd-cmd-params-cm ConfigMap:

yaml
1apiVersion: v1
2kind: ConfigMap
3metadata:
4  name: argocd-cmd-params-cm
5  namespace: argocd
6data:
7  controller.sharding.algorithm: "round-robin"

Then scale the StatefulSet:

bash
1kubectl scale statefulset argocd-application-controller --replicas=3 -n argocd

And don’t forget to set the replica count in the environment variable (or the controller won’t know about the other shards):

bash
1kubectl set env statefulset/argocd-application-controller \
2  ARGOCD_CONTROLLER_REPLICAS=3 -n argocd

How many shards do you need? A rough rule of thumb from community experience:

Applications Clusters Recommended Shards
< 500 < 10 1 (no sharding needed)
500-1500 10-30 2-3
1500-3000 30-50 3-5
3000+ 50+ 5-10

These numbers assume roughly equal distribution of applications across clusters. If you have one mega-cluster with 2,000 apps and 30 small ones, see the next section.


4. Advanced Cluster Assignment: When Auto-Sharding Isn’t Enough Link to heading

Automatic sharding distributes clusters evenly, but clusters aren’t equal. If you have a production cluster with 1,500 applications and a dev cluster with 20, round-robin will happily assign them to the same shard. That shard is now doing 75x more work than the one handling the dev cluster. RIP.

Manual Shard Assignment Link to heading

You can explicitly pin a cluster to a specific shard by setting the shard field in the cluster’s Secret:

yaml
 1apiVersion: v1
 2kind: Secret
 3metadata:
 4  name: production-megacluster
 5  labels:
 6    argocd.argoproj.io/secret-type: cluster
 7type: Opaque
 8stringData:
 9  shard: "1"                                    # Pin to shard 1
10  name: production-megacluster.example.com
11  server: https://production-megacluster.example.com
12  config: |
13    {
14      "bearerToken": "<authentication token>",
15      "tlsClientConfig": {
16        "insecure": false,
17        "caData": "<base64 encoded certificate>"
18      }
19    }

Now shard 1 is dedicated to the big production cluster, and the other shards handle the remaining, lighter clusters.

The “Fake Clusters” Trick for Single-Cluster Sharding Link to heading

Here’s a trick from the community that’s clever as hell. What if you only have one Kubernetes cluster, but it has thousands of applications? Sharding won’t help because there’s only one cluster to shard.

The workaround? Create multiple Kubernetes ExternalName services that point back to the same API server:

yaml
 1apiVersion: v1
 2kind: Service
 3metadata:
 4  name: kubernetes-shard-1
 5  namespace: default
 6spec:
 7  type: ExternalName
 8  externalName: kubernetes.default.svc.cluster.local
 9---
10apiVersion: v1
11kind: Service
12metadata:
13  name: kubernetes-shard-2
14  namespace: default
15spec:
16  type: ExternalName
17  externalName: kubernetes.default.svc.cluster.local

Then register each ExternalName service as a separate “cluster” in ArgoCD. Boom — you’ve tricked ArgoCD into thinking it has multiple clusters, and sharding distributes applications across controller replicas.

Is this a hack? Absolutely. Does it work? Also absolutely. The community recommends this for single-cluster setups with thousands of apps. Use RBAC to restrict each “fake cluster” to specific namespaces to get a clean separation.


5. Dynamic Cluster Distribution (ArgoCD 2.9+) Link to heading

One of the most annoying things about the traditional sharding model is that the controller runs as a StatefulSet. Adding or removing shards requires manually setting ARGOCD_CONTROLLER_REPLICAS and restarting all controller pods. That’s downtime. That’s scary. That’s “I hope nothing breaks at 2 AM” territory.

Starting with ArgoCD 2.9, there’s an alpha feature called Dynamic Cluster Distribution that changes the game:

  • The application controller can run as a Deployment instead of a StatefulSet
  • Clusters are dynamically rebalanced across shards when replicas are scaled up or down
  • No more manual ARGOCD_CONTROLLER_REPLICAS environment variable juggling
  • A ConfigMap called argocd-app-controller-shard-cm tracks the mapping between controller pods and shard numbers, including a heartbeat mechanism

Enabling Dynamic Cluster Distribution Link to heading

To enable this feature, you need to set the ARGOCD_ENABLE_DYNAMIC_CLUSTER_DISTRIBUTION environment variable to true and convert the controller from a StatefulSet to a Deployment (Kustomize overlays are available in the ArgoCD repo for this).

yaml
1env:
2  - name: ARGOCD_ENABLE_DYNAMIC_CLUSTER_DISTRIBUTION
3    value: "true"

Warning: This is still an alpha feature. It works well in many production environments, but test thoroughly in staging first. The ArgoCD team is actively collecting community feedback to move it to production-ready status.


6. Performance Tuning for 3000+ Applications Link to heading

Sharding gets you most of the way there, but squeezing out peak performance requires tuning the internal processing pipelines.

Reconciliation Processors Link to heading

Each controller replica uses two separate queues:

  1. Status processors — Handle application reconciliation (checking desired vs. actual state). Default: 20
  2. Operation processors — Handle sync operations. Default: 10

For 3,000+ apps, you need to crank these up significantly. The official docs suggest 50 status processors and 25 operation processors for 1,000 applications. Extrapolating:

Applications (per shard) --status-processors --operation-processors
< 500 20 (default) 10 (default)
500-1000 50 25
1000-2000 80 40
2000+ 100+ 50+

Configure these in argocd-cmd-params-cm:

yaml
1data:
2  controller.status.processors: "50"
3  controller.operation.processors: "25"

Heads up: More processors = more CPU and memory usage. Monitor your controller pods with Prometheus and adjust accordingly. There’s a sweet spot — going too high just wastes resources without improving throughput.

Reconciliation Timeout and Jitter Link to heading

The controller polls Git every 3 minutes by default. You can change this via timeout.reconciliation in argocd-cm:

yaml
1apiVersion: v1
2kind: ConfigMap
3metadata:
4  name: argocd-cm
5data:
6  timeout.reconciliation: "300s"    # 5 minutes
7  timeout.reconciliation.jitter: "60s"  # up to 60s of random jitter

Why jitter matters: Without jitter, all 3,000 applications hit the reconciliation timeout at roughly the same time, creating a thundering herd problem. The repo-server gets slammed with 3,000 manifest generation requests simultaneously. Jitter spreads these out over a window (timeout + 0 to jitter seconds), smoothing the load curve.

The ARGOCD_RECONCILIATION_JITTER environment variable controls this too (defaults to 60 seconds). For large installations, bump it to 120 or 180 to spread things out even more.

Kubernetes API Server Interactions Link to heading

The controller maintains a lightweight in-memory cache of Kubernetes resources using the Watch API. This avoids querying the K8s API during reconciliation and is a huge performance win. However, there are some gotchas:

  • ARGO_CD_UPDATE_CLUSTER_INFO_TIMEOUT: The controller updates cluster info every 10 seconds. If your cluster network has latency issues, increase this timeout (value in seconds).

  • ARGOCD_CLUSTER_CACHE_LIST_PAGE_BUFFER_SIZE: For clusters with a massive number of resources (think: Azure clusters with thousands of CRDs from azure-service-operator), the initial cache sync might time out. Increase this buffer size so the controller can pre-fetch more pages from the K8s API before the etcd compaction interval expires.

  • ARGOCD_CLUSTER_CACHE_BATCH_EVENTS_PROCESSING: Enabled by default (true). Collects Kubernetes watch events and processes them in batches rather than one-by-one. Leave this on.

  • ARGOCD_CLUSTER_CACHE_EVENTS_PROCESSING_INTERVAL: Controls the batch interval (default 100ms). If the controller is overwhelmed by events from a very active cluster, increase this to 200ms or 500ms.

  • The controller caches only the preferred versions of resources. If your Git manifests use a non-preferred API version, the controller falls back to direct K8s API queries (slow!). Always use the preferred API version in your manifests to avoid this.

The GOMEMLIMIT Trick Link to heading

This one comes straight from real-world battle scars. ArgoCD is written in Go, and Go’s garbage collector can be… overly relaxed. During controller restarts or full reconciliations, memory usage spikes dramatically as thousands of objects are allocated. If the spike exceeds the container’s memory limit, the OOM killer strikes.

The fix? Set the GOMEMLIMIT environment variable to ~90% of your container’s memory limit:

yaml
1env:
2  - name: GOMEMLIMIT
3    value: "1800MiB"  # If your limit is 2Gi

This tells Go’s runtime to be more aggressive about garbage collection as memory approaches the limit. You’ll trade a tiny bit of CPU for dramatically fewer OOM kills. It’s one of those “why isn’t this the default” settings.

Repo Server Timeout Link to heading

When the application controller asks the repo-server to generate manifests, there’s a timeout (--repo-server-timeout-seconds, default 60s). If your Helm charts are complex (looking at you, kube-prometheus-stack), manifest generation can exceed this.

If you see Context deadline exceeded errors in the controller logs, bump this up:

yaml
1data:
2  controller.repo.server.timeout.seconds: "180"

7. Monorepo Scaling Considerations Link to heading

Ah, the monorepo. Everyone loves it until ArgoCD has to deal with it.

The repo-server maintains one clone of each repository. If manifest generation modifies files in the local clone (some tools do this), only one manifest generation can run at a time for that repo. With 50+ applications in a monorepo, this serialization kills performance.

Enable Concurrent Processing Link to heading

The repo-server checks whether manifest generation has side effects and parallelizes when safe. Known scenarios and workarounds:

  1. Multiple Helm apps in the same directory: Starting with ArgoCD v3.0, Helm manifest generation is parallel by default. If you’re on an older version, add a .argocd-allow-concurrency file to the chart directory.

  2. Custom plugin apps: Avoid writing temporary files during generation. Add .argocd-allow-concurrency to the app directory, or use the sidecar plugin option which processes each application with a temporary copy of the repository.

  3. Kustomize with parameter overrides: Sorry, no workaround for this one. If you’re using Kustomize parameter overrides on multiple apps in the same repo, they’ll be serialized. Consider restructuring to avoid overrides.

Manifest Paths Annotation (This Is Huge!) Link to heading

By default, ArgoCD uses the repository commit SHA as the cache key. A new commit to any file in the repo invalidates the cache for all applications in that repo. If your monorepo has 200 apps and someone changes a README, all 200 apps regenerate their manifests. Wasteful.

The argocd.argoproj.io/manifest-generate-paths annotation tells ArgoCD which paths actually matter for each application:

yaml
 1apiVersion: argoproj.io/v1alpha1
 2kind: Application
 3metadata:
 4  name: payments-service
 5  annotations:
 6    # Only regenerate if files in these paths changed
 7    argocd.argoproj.io/manifest-generate-paths: ".;../shared"
 8spec:
 9  source:
10    repoURL: https://github.com/myorg/k8s-manifests.git
11    targetRevision: HEAD
12    path: services/payments

Now, if someone commits a change to services/auth/, the payments-service application won’t regenerate its manifests. This is a massive performance win for monorepos.

You can use:

  • Relative paths (. = app’s own directory, ../shared = sibling directory)
  • Absolute paths (/shared/base-configs)
  • Glob patterns (/shared/*-secret.yaml)
  • Multiple paths separated by semicolons (;)

Note: Since ArgoCD v2.11, this annotation works without configuring webhooks. You can rely on it standalone for all manifest generation optimization.

Shallow Cloning Link to heading

For large repositories with extensive Git history, cloning can be painfully slow. Enable shallow cloning with depth: "1" in the repository secret:

yaml
 1apiVersion: v1
 2kind: Secret
 3metadata:
 4  name: my-monorepo
 5  labels:
 6    argocd.argoproj.io/secret-type: repository
 7type: Opaque
 8stringData:
 9  depth: "1"
10  type: "git"
11  url: "https://github.com/myorg/k8s-manifests.git"

Or via CLI:

bash
1argocd repo add https://github.com/myorg/k8s-manifests.git --depth 1

This clones only the most recent commit instead of the full history. For a monorepo with 10,000+ commits, this can reduce clone time from minutes to seconds.


8. Rate Limiting and Workqueue Tuning Link to heading

When things go wrong (and they will), applications can get into sync loops. A single misbehaving app can cause the reconciliation queue to spiral, impacting all applications. ArgoCD provides rate limiting controls to prevent this.

Global Rate Limits Link to heading

A simple bucket-based rate limiter that controls how many items can be queued per second:

yaml
1env:
2  - name: WORKQUEUE_BUCKET_SIZE
3    value: "500"        # Max items in a single burst (default: 500)
4  - name: WORKQUEUE_BUCKET_QPS
5    value: "50"          # Items queued per second (default: unlimited!)

The default WORKQUEUE_BUCKET_QPS is MaxFloat64 — which means it’s effectively disabled. For large installations, set this to a sane value (e.g., 50-100) to prevent a cascade of reconciliations from overwhelming the controller.

Per-Item Rate Limits Link to heading

This limits how many times a specific application can be re-queued in a short period. It uses exponential backoff:

yaml
 1env:
 2  # Enable exponential backoff (disabled by default)
 3  - name: WORKQUEUE_FAILURE_COOLDOWN_NS
 4    value: "10000000000"    # 10 seconds in nanoseconds
 5  - name: WORKQUEUE_BASE_DELAY_NS
 6    value: "1000000"        # 1ms initial backoff
 7  - name: WORKQUEUE_MAX_DELAY_NS
 8    value: "10000000000"    # 10s max backoff
 9  - name: WORKQUEUE_BACKOFF_FACTOR
10    value: "2.0"            # Double the backoff each retry

The formula: if an application keeps getting re-queued before the cooldown period expires, the backoff increases exponentially. If the cooldown period passes without the item being re-queued, the backoff resets. This is incredibly useful for preventing a broken application from monopolizing controller resources.


9. Monitoring, Profiling, and Debugging Link to heading

You can’t optimize what you can’t measure. ArgoCD exposes Prometheus metrics on each component’s metrics port.

Key Metrics to Watch Link to heading

Metric What to Look For
argocd_app_reconcile_duration_seconds If this keeps climbing, your controller is struggling
argocd_app_sync_total Rate of sync operations
argocd_cluster_api_resource_objects Number of tracked Kubernetes objects per cluster
argocd_redis_request_duration Redis latency — spikes indicate cache pressure
argocd_git_request_duration_seconds Git operation latency
argocd_repo_pending_request_total Queued manifest generation requests

CPU/Memory Profiling Link to heading

ArgoCD optionally exposes Go’s pprof profiling endpoint. Enable it in argocd-cmd-params-cm:

yaml
1data:
2  controller.profile.enabled: "true"

Then profile with:

bash
1kubectl port-forward svc/argocd-metrics 8082:8082
2go tool pprof http://localhost:8082/debug/pprof/heap    # Memory profile
3go tool pprof http://localhost:8082/debug/pprof/profile  # CPU profile (30s)

This is invaluable when hunting down why a specific controller shard is using 4GB of RAM. The heap profile will show you exactly which objects are eating memory.

gRPC Performance Metrics Link to heading

For deep troubleshooting, enable gRPC time histograms:

yaml
1env:
2  - name: ARGOCD_ENABLE_GRPC_TIME_HISTOGRAM
3    value: "true"

Warning: These metrics are expensive to both collect and store. Only enable them when actively debugging performance issues, then disable when done.

The Nuclear Option: Rolling Restarts Link to heading

Sometimes, after a particularly bad incident (network outage, API server brownout, etc.), the reconciliation queue gets hopelessly backed up. The fastest fix? Roll the controller pods:

bash
1kubectl rollout restart statefulset argocd-application-controller -n argocd

This clears the queue and rebuilds the in-memory cache from scratch. It’s not elegant, but it works. Think of it as the “turn it off and on again” of GitOps.


10. Summary: The Scaling Checklist Link to heading

Scaling ArgoCD is a journey of methodically removing bottlenecks. Here’s the complete playbook for a 3,000+ application environment:

Foundation Link to heading

  • Deploy using HA manifests (3+ nodes, Redis Sentinel)
  • Scale argocd-server to 3+ replicas with ARGOCD_API_SERVER_REPLICAS
  • Scale argocd-repo-server to 5+ replicas with --parallelismlimit 10-20
  • Increase ARGOCD_GRPC_MAX_SIZE_MB to 400+

Sharding Link to heading

  • Enable round-robin or consistent-hashing sharding
  • Scale argocd-application-controller to appropriate replica count
  • Use manual shard assignment for outlier (large) clusters
  • Consider dynamic cluster distribution (ArgoCD 2.9+) for easier scaling

Performance Tuning Link to heading

  • Increase status-processors and operation-processors
  • Enable reconciliation jitter to prevent thundering herd
  • Set GOMEMLIMIT to 90% of container memory limit
  • Increase repo-server-timeout-seconds for complex charts

Monorepo Optimization Link to heading

  • Add manifest-generate-paths annotations to all applications
  • Enable shallow cloning for large repos
  • Add .argocd-allow-concurrency files where applicable

Monitoring Link to heading

  • Set up Prometheus + Grafana dashboards for ArgoCD metrics
  • Enable CPU/memory profiling for troubleshooting sessions
  • Configure rate limiting to prevent cascading failures

By following this playbook, your ArgoCD installation will handle 3,000+ applications like a champ. Your on-call engineers will sleep better, your developers will get faster feedback loops, and you can go back to bragging about GitOps on Reddit — this time with the receipts to back it up.


Have questions or war stories of your own? Find me on LinkedIn or reach out — I love talking about this stuff.