How We Reduced Our Deployment Failure Rate to Under 2%

#sre #devops #deployment #ci

Two years ago our deployment failure rate was around 18%. Today it's under 2%. Here's what we actually changed no silver bullet, just boring discipline.

The changes that moved the needle

1. Required pre-deploy CI green on the exact commit. Sounds obvious. Wasn't. People were deploying from branches that hadn't finished CI. We made it impossible.

2. Database migrations run in a separate step. Before, schema changes and code changes landed together. Migration failure = full rollback. We split them: migrations run first, deploy second. Halved the rollback cost.

3. Canary to 1% before full rollout. Any canary anomaly (error rate, latency, traffic drop) auto-aborts the deploy. Automated, not manual.

4. Pre-deploy health checks against dependencies. Before we start deploying service X, we check that its downstream dependencies are healthy. Deploy fails fast if they're not.

5. Friday deploy ban. Not religious about it if it's urgent, deploy. But the default is no. Solved 15% of incidents immediately because Monday is cheaper than Friday for diagnosis.

What didn't work

More tests in CI (we had enough)
Manual approval gates (just added delay)
Slack notifications to 'watch' deploys (nobody watches)

The meta-lesson

Deployment reliability is a systems problem, not a willpower problem. Every fix we made was a system change, not a 'everyone please be more careful' memo. Those memos never work.

2% is still our target to beat. We'll get there with the next round of boring discipline.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com