Samson Tanimawo

Posted on May 1

CI/CD Reliability: When Your Deploy Pipeline is Your SPOF

#cicd #reliability #devops #deployments

The Invisible SPOF

Every engineering org has a single point of failure that nobody lists on their risk registry: the deploy pipeline itself.

When CI/CD breaks, you can't ship features. You can't deploy hotfixes. You can't roll back a broken release. Your production doesn't go down, but your ability to fix production does.

We had a 4-hour outage last year caused by a GitHub Actions incident. Not a single server went down. We just couldn't deploy the fix.

Categorizing the Risk

Your pipeline consists of:

source_control: # GitHub, GitLab, Bitbucket
failure_mode: "can't merge PRs"

ci_runners: # GitHub Actions, CircleCI, self-hosted
failure_mode: "builds don't run"

artifact_storage: # ECR, Artifactory, S3
failure_mode: "images don't build or push"

deployment_controller: # ArgoCD, Flux, Spinnaker
failure_mode: "deploys don't happen"

cluster_api: # k8s API, cloud provider API
failure_mode: "resources don't change"

Each layer is a failure domain. A serious pipeline needs fallback plans for each.

The Manual Escape Hatch

Rule #1: You must have a documented path to deploy manually.

Not for daily use for emergencies. Every team should know:

How to build the image locally
How to push to the registry
How to update the cluster without the normal pipeline
Who has permission to do this in production

We test this quarterly. Every SRE must deploy one service manually, end-to-end, in under 10 minutes, without the pipeline.

Hardening the Pipeline Itself

1. Pin your dependencies

# BAD
uses: actions/checkout@main

# GOOD
uses: actions/checkout@v4.1.1

If actions/checkout@main breaks, your deploys break. Pin to versions.

2. Cache everything locally

registry:
primary: ghcr.io/yourorg
fallback: ecr.amazonaws.com/yourorg

When the primary registry is down, you need a mirror. Every production image should exist in at least two registries.

3. Monitor the pipeline

You probably monitor your services. Do you monitor your CI?

pipeline_metrics:
- build_success_rate (target: >99%)
- deploy_duration_p99 (target: <5 min)
- time_to_rollback_p99 (target: <2 min)
- runner_queue_depth (target: <5)

Alert on these the same way you'd alert on a service.

4. Test disaster modes

Can you ship if GitHub Actions is down?
Can you ship if the main registry is unreachable?
Can you ship if ArgoCD is down?

If the answer is "no", you have undocumented SPOFs.

The Rollback Rule

Every deploy must be reversible in under 2 minutes. No exceptions.

time_to_deploy: 15 minutes
time_to_rollback: 90 seconds

If your rollback takes longer than your deploy, your pipeline is backwards.

How to achieve fast rollbacks:

Keep the previous image running in parallel during deploys
Use traffic-shifting deploys (ALB weights, Istio)
Label every image with the git commit
Never deploy untested rollback paths

The Deploy Freeze

Some teams never deploy on Fridays. This is cargo culting.

The real rules:

Don't deploy when the on-call person is asleep
Don't deploy during peak traffic windows
Don't deploy major changes during holidays
DO deploy hotfixes anytime

If Friday at 5pm is the only time you can deploy a fix, you deploy. The alternative is customers suffering all weekend.

A reliable pipeline makes any-time deploys safe. Banning Friday deploys means your pipeline isn't reliable enough.

Multi-Provider Strategy

Big-ticket item: run critical workloads on CI from a different vendor than your code host.

Code: GitHub
CI: CircleCI (not GitHub Actions)

When GitHub Actions is down (it happens twice a year), your builds still run. When CircleCI is down, you can fall back to GitHub Actions.

This doubles your CI bill but removes a major SPOF.

The "Break Glass" Deploy

Every pipeline should have an emergency bypass:

# Normal deploy (takes 15 minutes, runs all tests)
./deploy.sh

# Break-glass deploy (skips tests, full audit log, Slack alert)
./deploy.sh --break-glass --reason "Fixing P1 incident #1234"

The break-glass path:

Requires written justification
Skips long-running tests
Notifies the whole team
Writes to a permanent audit log
Can only be used with incident in progress

Used maybe 3-5 times a year. Without it, your 2-hour deploy pipeline becomes a bottleneck when every minute matters.

The Metric That Matters Most

Mean Time to Deploy a Hotfix (MTTDHF)

From "we need to fix this" to "fix is in production" how long?

Good: under 30 minutes
Great: under 10 minutes
Unicorn: under 5 minutes

Track this. Optimize it. It's the most important reliability metric nobody talks about.

The Takeaway

Your pipeline is production infrastructure. Treat it with the same respect.

Monitor it
Back it up
Test failure modes
Document manual paths
Never let it become a SPOF

When it breaks during an incident, you'll be very glad you did.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

DEV Community