Release on Demand

#cicd #devops #kubernetes #terraform

There's a calendar invite on your team's shared calendar called something like "Release Window." Maybe it's every other Thursday at 2pm. There's a change freeze the day before. A go/no-go thread. A rollback plan nobody's tested. And a quiet, shared understanding that you do not, under any circumstances, ship after lunch on a Friday. That invite is not a process. It's a confession.
It says: we don't trust our own releases, so we've agreed to only do the scary thing on a schedule, in daylight, with everyone watching. Call it discipline if you want. It's really a workaround for a capability the team never built.

The doom loop nobody names

Here's the trap, and most teams are living in it without naming it. There's a gap between "this commit passed CI" and "this code is safe to ship." For a lot of teams that gap is measured in calendar weeks. Every change pays it. So you do the rational thing: you batch. You wait until you've got enough changes to justify the ceremony, then you ship them all at once.
Except batching is the thing that makes releases dangerous. A release with one change in it is easy to reason about. A release with forty changes in it is a haystack with an unknown number of needles. So the big batch needs more testing, more soak, more sign-off, which makes it slower to ship, which means the next batch is even bigger by the time it goes. Round and round. The schedule tightens. The freezes get longer. The hotfixes start going straight to prod outside the pipeline entirely, because the pipeline is the slow path and prod is on fire.
I've watched teams answer this with more process. A freeze before the freeze. A release captain. A go/no-go meeting with more people in the room every quarter. Every one of those is management layered on top of the fear, never a fix for the thing causing it.
That's the loop: the less you trust a release, the bigger you make it, and the bigger you make it, the less it deserves your trust. The schedule is just where you've parked the anxiety.
The schedule isn't keeping you safe. It's keeping a problem comfortable enough that you never have to fix it.

Faster doesn't fix it

The instinct is to make the engine faster. More CI workers. Parallel test shards. And now a fleet of agents that can open a few dozen correct-looking merge requests a day without coffee or ego. All real, all good. None of it touches the actual problem.
A faster engine on the same track just gets you to the same bottleneck sooner. If anything, cheap generation makes the gap worse: more commits piling up behind the same release window, more haystack per batch. You can't out-produce a trust problem. The fastest way to ship more code safely was never "type faster." It's to close the gap between green and shippable so completely that the schedule has nothing left to protect.
And the data's been clear on this for years. The DORA research (the Accelerate body of work) keeps finding the same counterintuitive result: the teams that deploy most often are also the most stable. Speed and safety rise together, because the same practices that make releases fast (small changes, strong automation, fast feedback) are the ones that make them safe. The teams shipping on demand have lower change-failure rates than the ones shipping on a quarterly calendar, not higher.
So the goal flips. Make every green commit shippable, and make the rare miss recoverable before a user ever notices. Two stacks do that work: a confidence stack on the way in, a recovery stack for when something slips through anyway.

Main is the train

Start with the thing that makes a release a non-event: stop treating "release" as a separate step at all.
Every commit on main is a candidate release. No release branch, no cut, no separate deploy job that someone runs. The way code gets to main is through a merge train (that's GitLab's name for it; GitHub ships the same thing as a merge queue, and the principle predates both as Graydon Hoare's Not Rocket Science Rule): an integration gate that tests each change against the projected merge result. Your change rebased on top of main plus everything queued ahead of you, all run as the commit that will actually land. If the train's pipeline is green, the exact commit that hits main has been proven integrated against the exact state it's joining.
Which kills a whole category of theater. Re-running CI on main after the merge tells you nothing the train didn't already prove. And a change that would break main never gets there. It fails its car on the train and gets pulled before it ever touches the branch. No "main is red at 2pm" scramble. No revert dance. No bisecting at midnight to find whose merge poisoned the well. The bad car comes off the track; the train keeps moving.
This is the difference between a green pipeline and a green build, and it matters more than it sounds. A pipeline that retries flaky tests until they pass, lets important checks warn-and-continue, and sets a coverage target low enough to clear in its sleep, that's a green light wired to nothing. A signal is only real once it has a measurement and a threshold. Green has to mean green, or every layer you build on top of it is built on a lie.

Build once, promote via overlay

When a change earns its way onto main, the train builds the container image one time and tags it by the commit SHA. That image is the artifact. It runs in dev immediately. And here's the part that makes promotion boring: shipping to prod doesn't rebuild anything. It re-points a config overlay (a Kustomize overlay, in practice, committed to the same repo) at the same SHA that's already running in dev. Same bytes, different environment.
Promotion is a routing decision, not a build decision. The thing you tested in dev is the literal thing that runs in prod. Not a rebuild from the same commit and a hope. The same image.
That cleanly splits two worlds that most setups jam into one giant deploy job. The slow-moving substrate (the cluster, the network, IAM, the shape of your manifests) changes on the order of a quarter and belongs to Terraform, ideally run through something like Terraform Cloud so applies are deliberate and auditable instead of run from someone's laptop. The fast-moving payload (image refs, rollouts, the things that change every commit) belongs to Git, reconciled continuously by a GitOps controller. Argo CD is the common one: it watches the repo, diffs the declared state against what's actually running in the cluster, and makes the cluster match. Change the overlay in Git, Argo CD notices and applies it. Flux does the same job if you prefer it. TF for the building, Git for the lights. The rule is simple: if it changes per release, Git owns it; if it changes per quarter, Terraform owns it. Match the tool to the rate of change and the six-hour deploy job that rewrites your infrastructure mid-flight just stops existing.
One more piece falls out of this: tags are markers, not artifacts. The cluster never deploys a tag. The tag is a point-in-time anchor that says "this commit's state was live in prod," useful for audit, for rollback targeting, and for release notes. It records what shipped and when. Nothing reads it to decide what runs.

The canary is the only honest soak

Most teams have a staging or beta environment that exists to "soak" changes before prod. I've shipped behind plenty of them, and here's what I learned the hard way: be honest about what one actually does. It runs the wrong shape of data and the wrong shape of traffic, so passing it only tells you a change survives conditions no real user will ever create. It's a dress rehearsal in an empty theater. The thing that's green in staging breaks the minute it meets production load and production-shaped data, and you find out from a customer.
The only place real confidence accrues is real traffic. So make prod itself the soak, one slice at a time. A new release enters at 1% of traffic and walks itself up: 1 to 5 to 25 to 50 to 100, and at every step it bakes against live signal (error rate, latency, worker health, plus a real-user-monitoring read on whether people are completing flows). Any red, at any stage, rolls back automatically to the last good release.
This is exactly what a progressive-delivery controller like Argo Rollouts exists to do. It owns the canary as a literal state machine: weight traffic up a step, run an analysis check against your metrics, advance if it's clean, roll back on its own if it isn't. You point the analysis at one internal health endpoint that aggregates the signals you care about, and the controller polls it at each gate. (If you're on a service mesh, Flagger covers the same ground.)
The reason this has to be automated is human nature. Nobody has the patience to sit and stare at the 1% dashboard and calmly decide it looks fine. They glance, it looks okay, they wave it through. So don't make them. The canary is a state machine. The signal decides; the train moves. A human only gets paged when something's actually wrong, into context that's already populated.
That same logic extends to the one thing you genuinely can't canary: infrastructure. You can't give 1% of users a new VPC CIDR. But the train can carry the infra change and apply it to dev as part of the car's pipeline, then check that dev came back up green before the change is allowed to merge. If the change broke dev, the merge request doesn't merge. So main and dev are green by construction. A broken state can't reach them, because the proof of merge-worthiness is "dev survived this." Dev becomes the canary for ops.
Stack those layers and they overdetermine the outcome. Real-shape pre-merge tests, an honest readiness probe, the statistical canary, an eyes-on check for the visual breakage that never throws an exception, internal users before external on every flag. Each catches a different class of failure. For a bug to reach prod and stay there, it has to evade all of them. That's what "shippable by default" actually means: not that nothing ever breaks, but that breaking and staying broken requires beating five independent defenses.

Recovery is the other half of courage

Confidence on the way in is only half the deal. The other half is knowing that when something slips through, and it will, you can pull it back before it matters. Recovery, ordered by how fast each lever is:

Flag kill, in milliseconds. If the bad behavior is behind a real backend flag (LaunchDarkly, Unleash, or anything that speaks OpenFeature), flip it off. No deploy, no rebuild. The code's still in prod, just not running. This is why a real flag system matters: without it, a "feature flag" is just a config value read at startup, and flipping it costs you a whole deploy. Real flags decouple deploy from release. You ship the code dark and turn it on when you're ready.- Overlay rollback, in minutes. Re-point the prod overlay at the previous SHA. Same train, opposite direction. The prior image is already built and already proven, so there's nothing to rebuild. Your delivery tooling unwinds the canary back to the last good release on its own.- Hotfix on the same train. Not a separate "emergency" pipeline that's less tested precisely because it's used least. The fix rides the normal train, and because the train is signal-driven, a clean change advances through the canary fast. The urgent path and the everyday path are the same path. That's the point.When rollback is "git revert" instead of "reprovision the world and pray," shipping stops feeling dangerous. You can build a fleet of agents on top of this, each one watching a signal and taking a bounded action (open a fix MR off a Sentry spike, close a coverage gap, quarantine a flaky test), and you don't have to bet on the agents being right. You bet on the same gates you already trust to ship human work. The train is the safety net for all of it. ## Burn the calendar invite Go back to that Release Window on the calendar. Once main is the train, once every green commit is the artifact, once the canary does the soak that beta only pretended to and rollback is a one-line revert, what is the window even protecting you from? The freeze was insurance against a release you couldn't trust. Build the trust into the track and the insurance is just friction you're still paying premiums on. I've said before that if you can't release any time, any day, Friday afternoon included, you don't have a courage problem, you have a tracks problem. This is what the tracks look like up close. The courage isn't something you summon in the go/no-go meeting. It's something the system gives you for free, every commit, all day, because it has earned the right to. So here's the real question: if shipping were genuinely safe at any hour, with no schedule and no war room, how much of your release process would you keep? Whatever survives that cut is the real work. The rest was just managing fear. Originally published at imacto.com. Written by Jason Waldrip, with Claude Opus 4.8.