DEV Community

Cover image for CI/CD Reliability: When Your Deploy Pipeline is Your SPOF
Samson Tanimawo
Samson Tanimawo

Posted on

CI/CD Reliability: When Your Deploy Pipeline is Your SPOF

The Invisible SPOF

Every engineering org has a single point of failure that nobody lists on their risk registry: the deploy pipeline itself.

When CI/CD breaks, you can't ship features. You can't deploy hotfixes. You can't roll back a broken release. Your production doesn't go down, but your ability to fix production does.

We had a 4-hour outage last year caused by a GitHub Actions incident. Not a single server went down. We just couldn't deploy the fix.

Categorizing the Risk

Your pipeline consists of:

source_control: # GitHub, GitLab, Bitbucket
failure_mode: "can't merge PRs"

ci_runners: # GitHub Actions, CircleCI, self-hosted
failure_mode: "builds don't run"

artifact_storage: # ECR, Artifactory, S3
failure_mode: "images don't build or push"

deployment_controller: # ArgoCD, Flux, Spinnaker
failure_mode: "deploys don't happen"

cluster_api: # k8s API, cloud provider API
failure_mode: "resources don't change"
Enter fullscreen mode Exit fullscreen mode

Each layer is a failure domain. A serious pipeline needs fallback plans for each.

The Manual Escape Hatch

Rule #1: You must have a documented path to deploy manually.

Not for daily use for emergencies. Every team should know:

  1. How to build the image locally
  2. How to push to the registry
  3. How to update the cluster without the normal pipeline
  4. Who has permission to do this in production

We test this quarterly. Every SRE must deploy one service manually, end-to-end, in under 10 minutes, without the pipeline.

Hardening the Pipeline Itself

1. Pin your dependencies

# BAD
uses: actions/checkout@main

# GOOD
uses: actions/checkout@v4.1.1
Enter fullscreen mode Exit fullscreen mode

If actions/checkout@main breaks, your deploys break. Pin to versions.

2. Cache everything locally

registry:
primary: ghcr.io/yourorg
fallback: ecr.amazonaws.com/yourorg
Enter fullscreen mode Exit fullscreen mode

When the primary registry is down, you need a mirror. Every production image should exist in at least two registries.

3. Monitor the pipeline

You probably monitor your services. Do you monitor your CI?

pipeline_metrics:
- build_success_rate (target: >99%)
- deploy_duration_p99 (target: <5 min)
- time_to_rollback_p99 (target: <2 min)
- runner_queue_depth (target: <5)
Enter fullscreen mode Exit fullscreen mode

Alert on these the same way you'd alert on a service.

4. Test disaster modes

Can you ship if GitHub Actions is down?
Can you ship if the main registry is unreachable?
Can you ship if ArgoCD is down?

If the answer is "no", you have undocumented SPOFs.

The Rollback Rule

Every deploy must be reversible in under 2 minutes. No exceptions.

time_to_deploy: 15 minutes
time_to_rollback: 90 seconds
Enter fullscreen mode Exit fullscreen mode

If your rollback takes longer than your deploy, your pipeline is backwards.

How to achieve fast rollbacks:

  • Keep the previous image running in parallel during deploys
  • Use traffic-shifting deploys (ALB weights, Istio)
  • Label every image with the git commit
  • Never deploy untested rollback paths

The Deploy Freeze

Some teams never deploy on Fridays. This is cargo culting.

The real rules:

  • Don't deploy when the on-call person is asleep
  • Don't deploy during peak traffic windows
  • Don't deploy major changes during holidays
  • DO deploy hotfixes anytime

If Friday at 5pm is the only time you can deploy a fix, you deploy. The alternative is customers suffering all weekend.

A reliable pipeline makes any-time deploys safe. Banning Friday deploys means your pipeline isn't reliable enough.

Multi-Provider Strategy

Big-ticket item: run critical workloads on CI from a different vendor than your code host.

Code: GitHub
CI: CircleCI (not GitHub Actions)
Enter fullscreen mode Exit fullscreen mode

When GitHub Actions is down (it happens twice a year), your builds still run. When CircleCI is down, you can fall back to GitHub Actions.

This doubles your CI bill but removes a major SPOF.

The "Break Glass" Deploy

Every pipeline should have an emergency bypass:

# Normal deploy (takes 15 minutes, runs all tests)
./deploy.sh

# Break-glass deploy (skips tests, full audit log, Slack alert)
./deploy.sh --break-glass --reason "Fixing P1 incident #1234"
Enter fullscreen mode Exit fullscreen mode

The break-glass path:

  • Requires written justification
  • Skips long-running tests
  • Notifies the whole team
  • Writes to a permanent audit log
  • Can only be used with incident in progress

Used maybe 3-5 times a year. Without it, your 2-hour deploy pipeline becomes a bottleneck when every minute matters.

The Metric That Matters Most

Mean Time to Deploy a Hotfix (MTTDHF)

From "we need to fix this" to "fix is in production" how long?

Good: under 30 minutes
Great: under 10 minutes
Unicorn: under 5 minutes

Track this. Optimize it. It's the most important reliability metric nobody talks about.

The Takeaway

Your pipeline is production infrastructure. Treat it with the same respect.

  • Monitor it
  • Back it up
  • Test failure modes
  • Document manual paths
  • Never let it become a SPOF

When it breaks during an incident, you'll be very glad you did.


Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

Top comments (0)