Raghav

Posted on Feb 6

How We Tamed the Thundering Herd in Our ArgoCD Multi-Tenant Setup

#architecture #cicd #devops #kubernetes

I manage over 2000+ tenant deployments spread across multiple Kubernetes clusters in five regions. Each tenant gets their own namespace, their own Helm release, their own set of microservices. For a long time, our ArgoCD setup worked fine. But I kept staring at it like that one check engine light you ignore for six months — you know it's going to ruin your day eventually.

So we ripped it out and rebuilt it. Here's how that went.

The Setup That Worked (For Now)

We were using ArgoCD's App-of-Apps pattern. If you haven't seen it, the idea is simple: you have a "parent" Application that manages a bunch of child Application YAMLs stored in Git. Each child Application points to a Helm chart and a values file for a specific tenant. Push a new Application YAML, and the parent picks it up and deploys it.

For individual tenant changes — like updating an image tag or tweaking a config — this works great. You change one tenant's values file, ArgoCD detects the diff, syncs that one tenant. Clean, isolated, predictable.

The problem shows up when you need to change something that affects everyone. And spoiler alert: you always eventually need to change something that affects everyone.

The Scenario That Kept Me Up at Night

Picture this: you need to bump the Helm chart version across all tenants. Routine change. You update the chart reference, merge the PR, go grab coffee feeling like a responsible engineer.

Now imagine what ArgoCD does the moment your back is turned. It detects that all 100+ Applications in the cluster are suddenly out of sync. Being the overly enthusiastic controller it is, it tries to sync every single one of them at the same time. They all kick off Helm renders, pull images, create new pods, run database migrations, and generally treat your cluster like a Black Friday checkout page.

The cluster autoscaler goes into overdrive. Nodes spin up but not fast enough. Pods get stuck in Pending. Some tenants that have heavier startup sequences — think 20+ minutes with sync waves, init containers, and post-deploy jobs — start timing out. A few get into crash loops because their dependencies aren't ready. Your monitoring dashboard lights up like a Christmas tree, except the only gift you're getting is a page at 2 AM.

We hadn't had the full-blown catastrophe yet. But we'd seen enough. Chart upgrades that made the cluster sweat. Staging deployments where half the tenants would briefly go unhealthy before settling. A few "that was close" moments during maintenance windows. Every time we pushed a shared change, I'd watch the ArgoCD dashboard like a parent watching their toddler near a swimming pool.

At 2000+ tenants and growing, it wasn't a question of if it would blow up. It was when. We decided to fix it before we found out the hard way. Call it paranoia. I call it job security.

That's the thundering herd problem. And if you're running multi-tenant on Kubernetes with GitOps, you'll eventually hit it. Or it'll hit you.

Why App-of-Apps Can't Fix This

The fundamental issue is that App-of-Apps gives you no control over how many applications sync at once. The parent Application manages child Application YAMLs. When it detects changes, it applies them all. There's no concept of "deploy tenant-1 first, wait until it's healthy, then move to tenant-2."

You could try to work around this with sync waves on the parent, but that gets messy fast. Sync wave annotations on every child Application YAML, manual orchestration of ordering. It doesn't scale, it's fragile, and honestly, if your deployment strategy requires a spreadsheet to understand, you've already lost.

We needed something that did progressive rollouts natively.

ApplicationSet with RollingSync

ArgoCD's ApplicationSet controller has a feature called Progressive Syncs (RollingSync). It's been in alpha for a while — which in Kubernetes terms means "it works, we just don't want to commit yet." But it solves exactly this.

Instead of a parent Application managing child Application YAMLs, you define an ApplicationSet — a template that generates Application resources from a data source. Files in Git, a directory, a config map, whatever. The part that matters is the strategy field:

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: my-tenants
  namespace: argocd
spec:
  generators:
    - git:
        repoURL: https://github.com/myorg/deployments.git
        revision: HEAD
        files:
          - path: "tenants/*.yaml"

  strategy:
    type: RollingSync
    rollingSync:
      steps:
        - maxUpdate: 1
      progressDeadlineSeconds: 600

  template:
    metadata:
      name: '{{ .name }}'
      namespace: argocd
    spec:
      project: default
      destination:
        server: https://kubernetes.default.svc
        namespace: '{{ .name }}'
      sources:
        - repoURL: https://github.com/myorg/helm-charts.git
          targetRevision: '{{ .chartVersion }}'
          path: charts/my-app
          helm:
            valueFiles:
              - '$values/tenants/values/global-values.yaml'
              - '$values/tenants/values/{{ .name }}.yaml'
        - repoURL: https://github.com/myorg/deployments.git
          targetRevision: HEAD
          ref: values

See maxUpdate: 1? That's the whole trick. When a change hits Git that makes multiple Applications out of sync, the ApplicationSet controller will only sync one at a time. It waits for that Application to reach a healthy state before moving to the next one.

Two thousand tenants that need updating? They go one by one, like civilized applications. Or set maxUpdate: "10%" to do 200 at a time if you're feeling adventurous. You control the blast radius.

Two Lines Per Tenant. Seriously.

This is the part that surprised me. Each tenant definition file in Git is just:

name: customer-abc
chartVersion: v2.1.0

That's it. I've written longer Slack messages deciding where to get lunch. The ApplicationSet template handles everything else — cluster destination, Helm chart path, values file paths. All derived from the tenant name using Go templates.

Want to upgrade one tenant to a new chart version? Change their chartVersion. Want to upgrade everyone? Change all the files. RollingSync makes sure it happens gradually either way.

Global Values + Tenant Overrides

We use a two-tier values hierarchy:

Global values file: Shared config across all tenants — image repositories, resource defaults, common infrastructure settings. Ours is a few thousand lines. Yes, I know. No, I don't want to talk about it.
Tenant values file: Just the overrides — hostname, client ID, specific image tags, feature flags. Usually 20-50 lines.

Helm merges them in order. Tenant file wins on conflicts. So when we need to update a shared setting (like a base image registry), we change one file and all tenants pick it up on their next sync — but through the RollingSync, so only a few at a time. Polite queue instead of a mosh pit.

The Migration Was Not Fun

Moving from App-of-Apps to ApplicationSet wasn't a simple config swap. The tricky part is ownership. In Kubernetes, when a controller creates a resource, it sets an ownerReference on it. ArgoCD Applications are Kubernetes resources. When the ApplicationSet controller creates an Application, it owns it.

But we already had existing Applications created by the App-of-Apps pattern. We couldn't just delete them and let the ApplicationSet recreate them — that would delete all the underlying pods and cause downtime. "Sorry customers, we're improving our deployment process" doesn't really fly when their app goes down for 20 minutes.

Here's what actually worked:

Step 1: Create the ApplicationSet alongside the existing App-of-Apps. Same Application names. ArgoCD's ApplicationSet controller has an adoption feature — if it finds an existing Application with the same name, it adds its ownerReference instead of creating a duplicate. Custody transfer, but for Kubernetes resources.

Step 2: Remove the finalizer from the App-of-Apps managed Application. This part is critical, and if you skip it, you will learn why the hard way. ArgoCD Applications typically have a resources-finalizer.argocd.argoproj.io finalizer that tells ArgoCD "when this Application is deleted, also delete all the Kubernetes resources it manages." Skip this step, and deleting the App-of-Apps will cascade-delete every pod, service, and deployment in the tenant's namespace. Ask me how I know.

kubectl patch application my-tenant -n argocd \
  --type='json' \
  -p='[{"op": "remove", "path": "/metadata/finalizers"}]'

Step 3: Remove the tenant from the App-of-Apps. Finalizer is gone, so the Application resource gets deleted cleanly without touching the actual workloads. The ApplicationSet already owns a new Application with the same name managing the same resources. Zero downtime. The pods don't even know anything happened. Blissful ignorance.

Step 4: Repeat for each tenant. Yes, it's repetitive. We scripted it. Nobody got into SRE because they enjoy doing the same thing 2000 times.

What Broke During Testing (and What Didn't)

We ran the ApplicationSet with 5 full test deployments before going anywhere near production. Each deployment had around 20 running pods — databases, caches, API services, the whole stack.

RollingSync just works. When we pushed changes to all 5 tenants simultaneously, only one synced at a time. ApplicationSet status showed: 1 Progressing, 4 Waiting. I may have whispered "beautiful" at my monitor.

Failure isolation is real. We intentionally broke one tenant by pointing it at a non-existent image tag (the tag was literally called broken-image-test-does-not-exist because I believe in clear naming). It went into ImagePullBackOff while the other 4 stayed completely healthy. One bad apple, bunch not spoiled. Finally.

progressDeadlineSeconds does what it says. Our heavier applications take 20+ minutes to fully start (sync waves, database migrations, init containers — the whole nine yards). We set progressDeadlineSeconds: 1800 (30 minutes). When the broken tenant couldn't become healthy, the sync got terminated after 30 minutes and marked as failed. ArgoCD's retry policy kicked in — 3 retries with exponential backoff — then the ApplicationSet controller started a fresh cycle. It keeps trying until you fix the problem. You don't want a controller that silently gives up on a tenant. That's how you get paged at 3 AM asking "why is tenant-47 running last month's code?"

Fixing a broken tenant is just a Git push. We pushed corrected image tags, next sync cycle picked them up automatically. No kubectl, no manual intervention. Just fix the YAML and push. The repo is the source of truth, and the truth shall set your pods free.

The Per-Tenant Chart Version Thing

One thing we didn't plan for but ended up loving: because each tenant's definition file has its own chartVersion, we can upgrade tenants individually.

Before, with App-of-Apps, the chart version was set at the parent level. Upgrading the chart meant upgrading everyone at once. Back to the thundering herd. Now we can:

Roll out a new chart version to one tenant as a canary
Let it bake for a day
Gradually update the rest
Keep a few tenants on the old version if they need it

Chart upgrades went from nerve-wracking all-or-nothing events to boring gradual rollouts. Boring is good in operations. If your deploys are exciting, something is wrong.

Stuff I Wish Someone Told Me

The ApplicationSet controller caches Git HEAD resolution. You update a file in Git, the ApplicationSet doesn't pick it up, you panic. Don't. The controller polls on an interval. If you're impatient during testing (guilty), restarting the argocd-applicationset-controller pod forces a fresh poll. Not elegant, but it works.

PersistentVolumes with Retain policy will bite you. Delete a namespace and recreate it (which happens during testing more than you'd like to admit), and PVs with Retain policy keep a stale claimRef pointing to the old PVC. New PVCs can't bind. You'll spend 20 minutes wondering why your database won't start before you remember this. Patch the PV to clear the claimRef:

kubectl patch pv my-pv --type json \
  -p '[{"op": "remove", "path": "/spec/claimRef"}]'

Finalizers and deletion timestamps create loops. If an Application gets a deletionTimestamp (someone tried to delete it) but still has a finalizer, it enters a weird limbo. The ApplicationSet tries to recreate it, the finalizer tries to delete its resources, and they just keep fighting. Two polite people trying to go through a door at the same time, except both of them are holding knives. Remove the finalizer and the loop stops.

YAML comments don't trigger syncs. I once confidently added # this should trigger a sync to a values file and then stared at ArgoCD wondering why it was ignoring me. Turns out ArgoCD is smart enough to know the rendered manifests haven't changed. Comments are comments. You need an actual value change. The computer was right. I was wrong. It happens.

Use separate ApplicationSets for different app types. We have two: one for lighter tenant applications (3-5 minute startup) and one for heavier full-stack deployments (20+ minute startup). Different progressDeadlineSeconds, different maxUpdate values. One ApplicationSet for both would mean either too-long timeouts for the light apps or too-short timeouts for the heavy ones. It's like setting one speed limit for both a school zone and a highway. Just don't.

What It Looks Like Now

Here's roughly our production setup (sanitized):

deployments/
├── applicationsets/
│   ├── light-tenant-1.yaml    # name + chartVersion
│   ├── light-tenant-2.yaml
│   ├── heavy-app-1.yaml
│   └── heavy-app-2.yaml
└── values/
    ├── light-tenants/
    │   ├── global-values.yaml  # shared config
    │   ├── tenant-1.yaml       # tenant overrides
    │   └── tenant-2.yaml
    └── heavy-apps/
        ├── global-values.yaml
        ├── app-1.yaml
        └── app-2.yaml

Two ApplicationSets, two sets of values, one Git repo. Adding a new tenant is: create a 2-line definition file, create a values file with the overrides, push to Git. ApplicationSet picks it up on the next poll cycle and deploys it — one at a time, without disrupting anyone else. Onboarding a new customer used to be a 30-minute process. Now it's a PR with two files.

So Was It Worth Ripping Everything Out?

Look, I won't pretend it was a smooth ride. The migration had some gnarly edge cases with finalizers, the testing took longer than we expected, and I definitely yelled at a YAML file more than once. But the alternative was waiting for a chart upgrade to take down a production cluster and then doing the same work under pressure with customers screaming.

When thousands of tenants sync simultaneously, failures cascade in ways that are hard to predict and harder to debug. One tenant's resource pressure affects another tenant's pod scheduling. Database connection pools spike across the board. Monitoring alerts fire for tenants that are technically fine but just slow because the cluster is overloaded. It's chaos, and not the fun Netflix kind.

With RollingSync, our deployments are boring. One tenant deploys, gets healthy, next one goes. If something breaks, it breaks one tenant and we fix it. The other thousands don't even notice.

If you're running multi-tenant on Kubernetes with ArgoCD and you haven't hit the thundering herd yet — you will. ApplicationSet with RollingSync is the fix. It's alpha, it's got rough edges, but it works. And it turns your most stressful operational day into something you can do without canceling lunch.

Now if you'll excuse me, I have 2000+ tenants to not worry about.

References

ArgoCD Progressive Syncs — Official docs on the RollingSync strategy we used
RollingSync deploys all apps at once (GitHub #11924) — A real bug where RollingSync wasn't actually rolling. Good to know it's been fixed.
Auto-sync triggering on unrelated commits (GitHub #13598) — If you're wondering why all your apps synced when you only changed one file, start here
ArgoCD Production Troubleshooting — Covers sync storms, polling frequency, and the repo-server memory issues you'll hit at scale
Leveraging ArgoCD in Multi-tenanted Platforms — Good read on multi-tenant ArgoCD patterns if you're designing from scratch

I'm an SRE who runs multi-tenant Kubernetes clusters for a living. I write about the stuff that breaks at 2 AM and how to make sure it doesn't. If you're dealing with similar problems, drop a comment — misery loves company, and so does Kubernetes troubleshooting.

DEV Community