Rob Fox

Posted on Jan 12

Shift-Left Reliability

#sre #devops #cicd #platformengineering

We've become exceptionally good at incident response. Modern teams restore service quickly, run thoughtful postmortems, and hold themselves accountable through corrective actions.

And yet…

A team ships a change that passes every test, gets all the required approvals, and still brings down checkout for 47 minutes. The postmortem conclusion? "We should have known our latency SLO was already at 94% before deploying."

Many postmortems point to the same root cause: changes we introduced ourselves. Not hardware failures. Not random outages. Just software behaving exactly as we told it to.

We continue to treat reliability as something to evaluate once those changes are already live. This isn't a failure of tooling or process. It's a question of when we decide whether a system is ready.

The paradox

We've invested heavily in observing and responding to failure - better alerting, faster incident response, thorough postmortems. Teams care deeply about reliability and spend significant time optimizing how they respond to incidents.

But when in a service's lifecycle are they supposed to define reliability? Where's the innovation that happens before deployment?

Where reliability decisions actually happen today

I've seen multiple teams running identical technology stacks with completely different SLOs, metrics, and alerts. Nobody told them what to implement, what's best-practice or how to tune their alerts. They want to be good reliability citizens, but getting from the theory in the handbook to putting that theory into practice is not straightforward.

Services regularly move into production with SLOs being created months later - or never. Dashboards are missing, insufficient, or inconsistent. "Looks fine to me" during PR reviews. Tribal knowledge. Varying levels of understanding across teams.

Reliability is fundamentally bespoke and ungoverned. That's the core issue.

The missing layer

GitHub gave us version control for code. Terraform gave us version control for infrastructure. Security has transformed with shift-left - finding flaws as code is written, not after deployment.

We're still missing version control for reliability.

We need a specification that defines requirements, validates them against reality, and generates the artifacts: dashboards, SLOs, alerts, escalation policies. If the specification is validated and the artifacts created, the same tool can check in real-time whether a service is in breach - and block high-risk deployments in CI/CD.

What shift-left reliability actually means

Shift-left reliability doesn't mean more alerts and dashboards, more postmortems or more people in the room.
It means:

Spec - Define reliability requirements as code before production deployment
Validate - Test those requirements against reality
Enforce - Gate deployments through CI/CD

Engineers don't write PromQL or Grafana JSON - they declare intent,
and reliability becomes deterministic. Outcomes are predictable,
consistent, transparent, and follow best practice.

An executable reliability contract

Keep it simple. A team creates a service.yaml file with their reliability intent:

name: payment-api
tier: critical
type: api
team: payments
dependencies:
 - postgresql
 - redis

Here is a complete service.yaml example.

Tooling validates metrics, SLOs, and error budgets then generates these artifacts automatically. This is the approach I am exploring with an open-source project called NthLayer.

NthLayer runs in any CI/CD pipeline - GitHub Actions, ArgoCD, Jenkins, Tekton, GitLab CI. The goal isn't to be an inflexible blocker; it's visible risk and explicit decisions. Overrides are fine when they're intentional, logged, and owned.

When a deployment is attempted, the specification is evaluated against reality:

$ nthlayer check-deploy - service payment-api
 ERROR: Deployment blocked
 - availability SLO at 99.2% (target: 99.95%)
 - error budget exhausted: -47 minutes remaining
 - 3 P1 incidents in last 7 days

Exit code: 2 (BLOCKED)

Why now?

SLOs have had 8+ years to mature and move from the Google SRE Handbook into mainstream practice. GitOps has normalized declarative configuration. Platform Engineering has matured as a discipline. The concepts are ready but the tooling has lagged behind.

This is a deliberate shift in approach. Reliability is no longer up for debate during incidents. Services have defined owners with deterministic standards. We can stop reinventing the reliability wheel every time a new service is onboarded. If requirements change, update the service.yaml, run NthLayer and every service benefits from adopting the new standard.

What this does not replace

NthLayer doesn't replace service catalogs, developer portals, observability platforms, or incident management. It doesn't predict failures or eliminate human judgment. It's upstream of all these systems.

The goal: a reliability specification, automated deployment gates and to reduce cognitive load to implement best practices.

Open questions

I don't have all the answers but two questions I keep returning to are:

Contract Drift: What happens when the spec says 99.95% but reality has been 99.5% for months? Is the contract wrong, or is the service broken?
Emergency Overrides: How should they work? Who approves? How do you prevent them from becoming the default?

The timing problem

Where do reliability decisions actually happen in your organization? What would it look like to decide readiness before deployment? What reliability rules do you wish you could enforce automatically?

The timing problem isn't going away. The only question is whether you address it before deployment - or learn about it in the postmortem.

NthLayer is open source and looking for early adopters. If you're tired of reliability being an afterthought:

pip install nthlayer
nthlayer init
nthlayer check-deploy --service your-service

→ github.com/rsionnach/nthlayer

Star the repo, open an issue, or tell me I'm wrong. I want to hear how reliability decisions happen in your organization.

Rob Fox is a Senior Site Reliability Engineer focused on platform and reliability tooling. He's exploring how reliability engineering can move earlier in the software delivery lifecycle. Find him on GitHub.

DEV Community