The Invisible SPOF
Every engineering org has a single point of failure that nobody lists on their risk registry: the deploy pipeline itself.
When CI/CD breaks, you can't ship features. You can't deploy hotfixes. You can't roll back a broken release. Your production doesn't go down, but your ability to fix production does.
We had a 4-hour outage last year caused by a GitHub Actions incident. Not a single server went down. We just couldn't deploy the fix.
Categorizing the Risk
Your pipeline consists of:
source_control: # GitHub, GitLab, Bitbucket
failure_mode: "can't merge PRs"
ci_runners: # GitHub Actions, CircleCI, self-hosted
failure_mode: "builds don't run"
artifact_storage: # ECR, Artifactory, S3
failure_mode: "images don't build or push"
deployment_controller: # ArgoCD, Flux, Spinnaker
failure_mode: "deploys don't happen"
cluster_api: # k8s API, cloud provider API
failure_mode: "resources don't change"
Each layer is a failure domain. A serious pipeline needs fallback plans for each.
The Manual Escape Hatch
Rule #1: You must have a documented path to deploy manually.
Not for daily use for emergencies. Every team should know:
- How to build the image locally
- How to push to the registry
- How to update the cluster without the normal pipeline
- Who has permission to do this in production
We test this quarterly. Every SRE must deploy one service manually, end-to-end, in under 10 minutes, without the pipeline.
Hardening the Pipeline Itself
1. Pin your dependencies
# BAD
uses: actions/checkout@main
# GOOD
uses: actions/checkout@v4.1.1
If actions/checkout@main breaks, your deploys break. Pin to versions.
2. Cache everything locally
registry:
primary: ghcr.io/yourorg
fallback: ecr.amazonaws.com/yourorg
When the primary registry is down, you need a mirror. Every production image should exist in at least two registries.
3. Monitor the pipeline
You probably monitor your services. Do you monitor your CI?
pipeline_metrics:
- build_success_rate (target: >99%)
- deploy_duration_p99 (target: <5 min)
- time_to_rollback_p99 (target: <2 min)
- runner_queue_depth (target: <5)
Alert on these the same way you'd alert on a service.
4. Test disaster modes
Can you ship if GitHub Actions is down?
Can you ship if the main registry is unreachable?
Can you ship if ArgoCD is down?
If the answer is "no", you have undocumented SPOFs.
The Rollback Rule
Every deploy must be reversible in under 2 minutes. No exceptions.
time_to_deploy: 15 minutes
time_to_rollback: 90 seconds
If your rollback takes longer than your deploy, your pipeline is backwards.
How to achieve fast rollbacks:
- Keep the previous image running in parallel during deploys
- Use traffic-shifting deploys (ALB weights, Istio)
- Label every image with the git commit
- Never deploy untested rollback paths
The Deploy Freeze
Some teams never deploy on Fridays. This is cargo culting.
The real rules:
- Don't deploy when the on-call person is asleep
- Don't deploy during peak traffic windows
- Don't deploy major changes during holidays
- DO deploy hotfixes anytime
If Friday at 5pm is the only time you can deploy a fix, you deploy. The alternative is customers suffering all weekend.
A reliable pipeline makes any-time deploys safe. Banning Friday deploys means your pipeline isn't reliable enough.
Multi-Provider Strategy
Big-ticket item: run critical workloads on CI from a different vendor than your code host.
Code: GitHub
CI: CircleCI (not GitHub Actions)
When GitHub Actions is down (it happens twice a year), your builds still run. When CircleCI is down, you can fall back to GitHub Actions.
This doubles your CI bill but removes a major SPOF.
The "Break Glass" Deploy
Every pipeline should have an emergency bypass:
# Normal deploy (takes 15 minutes, runs all tests)
./deploy.sh
# Break-glass deploy (skips tests, full audit log, Slack alert)
./deploy.sh --break-glass --reason "Fixing P1 incident #1234"
The break-glass path:
- Requires written justification
- Skips long-running tests
- Notifies the whole team
- Writes to a permanent audit log
- Can only be used with incident in progress
Used maybe 3-5 times a year. Without it, your 2-hour deploy pipeline becomes a bottleneck when every minute matters.
The Metric That Matters Most
Mean Time to Deploy a Hotfix (MTTDHF)
From "we need to fix this" to "fix is in production" how long?
Good: under 30 minutes
Great: under 10 minutes
Unicorn: under 5 minutes
Track this. Optimize it. It's the most important reliability metric nobody talks about.
The Takeaway
Your pipeline is production infrastructure. Treat it with the same respect.
- Monitor it
- Back it up
- Test failure modes
- Document manual paths
- Never let it become a SPOF
When it breaks during an incident, you'll be very glad you did.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)