Justin Brooks

Posted on Feb 23 • Originally published at ctrlplane.dev

Why GitOps Doesn't Work at Scale (and What to Do Instead)

#devops #development #cicd #git

People talk about GitOps like it is the final form of delivery. In real life, it depends a lot on scale.

I have spent years helping teams go from one multi-tenant instance to hundreds of single-tenant instances. GitOps was useful early. However, for me at large scale, it became a constant fight.

One formula captures it well: P(failure) = 1 - p^n.

Where p is the chance each individual change works, and n is how many moving parts you have to coordinate. As n grows, failure risk climbs fast even if each single change is "pretty safe."

For example: you are deploying one release to 100 single-tenant customer environments, and each environment sync has a 99% success rate.

p = 0.99 (one environment sync succeeds 99% of the time)
n = 100 (100 environment syncs in the rollout wave)
1 - 0.99^100 = 0.634

So that rollout has about a 63% chance that at least one customer environment fails to deploy cleanly on the first pass.

P(failure)
1.00 ┬                                                            ●
     │                                                   ●
0.80 │                                            ●
     │                                     ●
0.60 │                               ●
     │                         ●
0.40 │                   ●
     │              ●
0.20 │         ●
     │    ●
0.00 └───────────────────────────────────────────────────────────────
       0     20     40     60     80     100    120    140    160   n

Formula-wise, you only have two levers:

reduce n (fewer independent steps per rollout)
increase p (make each step more reliable)

GitOps alone does not raise p for you. To improve p, you need other tooling and controls like preflight checks, dependency validation, rollout orchestration, retries, and policy guardrails.

Where GitOps works

GitOps is great when:

you have a small number of environments
ownership is clear
changes are low risk
teams are disciplined with review and automation

In that setup, Git gives you clean history, solid audit trails, and predictable rollouts.

small scale

dev -> PR -> merge -> deploy -> done
         (few moving parts, easy to reason about)

Where it starts to hurt

Once you have a big fleet, a few things happen fast.

Pull requests become your release system

Every deployment turns into repo choreography. More branches, more approvals, more waiting. You start optimizing for merge flow instead of delivery outcomes.

large scale

CI -> PR -> approval -> merge -> sync
        \-> policy check -> rebase -> approval -> merge -> sync
                      \-> hotfix PR -> cherry-pick -> re-sync

Rollbacks are not simple anymore

Rolling back one service is easy. Rolling back a whole environment with dependencies is not. Git can show you what changed, but it cannot restore all runtime conditions.

Config sprawl gets expensive

At scale, you end up with endless overrides: customer-specific, region-specific, compliance-specific, and emergency patches. The issue is not YAML itself. The issue is how much state humans must keep in their heads.

Out-of-band changes become normal

This is the part people avoid saying out loud.

At scale, teams will make changes outside GitOps. During incidents, during customer escalations, during vendor outages. Not because they are careless, but because they are solving an immediate problem.

If your model assumes that never happens, it is too idealistic for enterprise operations.

The split you see in GitOps opinions

GitOps lovers and GitOps haters are usually dealing with different scales.

At small scale, GitOps feels clean.

At enterprise scale, repo-centric workflows become too low-level for the job.

That is the real mismatch.

What actually works better

Do not throw away GitOps. Just stop treating Git as the entire control plane.

Use Git for intent and auditability. Add platform-level orchestration for:

preflight checks (before rollout starts, not after breakage)
strong defaults (safe rollout strategy, retries, timeouts, guardrails)
dependency validation (service A should not move before dependency B is healthy)

desired model

Git (intent) ---> Orchestrator ---> Fleet of environments
                    |      |                |   |   |
                    |      +-> policy       e1  e2  e3 ... eN
                    +-> rollout waves
                    +-> drift detection
                    +-> recovery paths

rollout waves
dependency ordering
policy enforcement
drift detection
safe recovery after out-of-band changes

This is the key point: you need a tool that can actively orchestrate and enforce these runtime controls. GitOps alone cannot provide that. Git can store desired state. It does not run rollout logic, cross-environment safety checks, or live dependency coordination by itself.

That is the practical model: GitOps as an input, not the whole operating system.

Final take

GitOps is good. Pure GitOps at enterprise scale usually is not.

The bigger you get, the more you need orchestration that lives above pull requests.

DEV Community