A team sets a 99.99% availability target for their checkout service. It's ambitious but achievable: they've done the work, invested in redundancy, and their metrics look solid.
Six months later, they're missing their target every single month. The postmortem reveals the problem: their critical path flows through three upstream services. The authentication service promises 99.9%. The payment gateway promises 99.95%. The inventory service promises 99.9%.
The math is straightforward: 0.999 × 0.9995 × 0.999 = 0.9975. Their theoretical ceiling is 99.75%, not 99.99%. The target for the checkout service was impossible from day one.
Nobody caught this because there's no standard way to express it. SLOs are set per-service, in isolation. Dependency information lives in architecture diagrams that nobody updates, service catalogs that are perpetually stale, and the heads of engineers who've since left the company. Nobody owns the cross-service math.
This is one of the things I've been building toward with NthLayer, and it's why I've developed OpenSRM: an open specification for declaring service reliability requirements as code.
The problem: Reliability is Bespoke and Ungoverned
I've seen teams running identical technology stacks (the same Kubernetes clusters, the same Prometheus instances, the same on-call rotations) with completely different SLOs, metrics, and alerting strategies. Not because one approach was better than another, but because nobody told them what to implement.
Services regularly move into production with SLOs being created months later, or never. Dashboards are missing, insufficient, or inconsistent. 'Looks fine to me' during PR reviews. Tribal knowledge. Varying levels of understanding across teams.
We have version control for code. We have version control for infrastructure. Security has transformed with shift-left practices, finding vulnerabilities as code is written rather than after deployment. But reliability? Still fundamentally bespoke and ungoverned.
What OpenSRM Looks Like
OpenSRM (Open Service Reliability Manifest) is a declarative YAML specification for service reliability requirements. A manifest defines what 'production-ready' means for a service: its SLO targets, ownership, dependencies, and the contracts it makes with other services.
Here's a basic manifest:
yaml
apiVersion: opensrm.io/v1
kind: ServiceReliabilityManifest
metadata:
name: checkout-service
tier: critical
template: api-critical
spec:
type: api
slos:
availability:
target: 0.9999
window: 30d
latency:
p99: 200ms
target: 0.995
window: 30d
ownership:
team: platform-checkout
slack: "#checkout-oncall"
pagerduty: CHECKOUT_CRITICAL
runbook: https://wiki.example.com/runbooks/checkout
The syntax is deliberately boring. The value isn't in clever YAML; it's in having a standard format that tooling can validate, generate from, and enforce.
Contracts: Separating Internal Targets from External Promises
One of the toughest problems in cross-team reliability is the gap between what a service measures internally and what it promises to others. Your payment service might target 99.995% availability internally (i.e. what you alert on), but only promise 99.99% externally (i.e. what dependent teams can rely on). OpenSRM makes this explicit with contracts:
yaml
spec:
contract: # What I promise to dependents
availability: 0.9999
latency:
p99: 300ms
slos: # Internal targets (tighter)
availability:
target: 0.99995 # Buffer above contract
latency:
p99: 200ms # Headroom below contract
This separation eliminates a category of cross-team arguments. Your internal SLOs are your business. Your contract is what others can depend on. The specification makes the boundary explicit.
Dependencies: The Math Nobody Does
Returning to our checkout service example, with OpenSRM dependency expectations become declarative:
yaml
spec:
dependencies:
- service: auth-service
critical: true
expects:
availability: 0.999
latency:
p99: 100ms
- service: payment-gateway
critical: true
expects:
availability: 0.9995
latency:
p99: 200ms
- service: inventory-service
critical: true
expects:
availability: 0.999
- service: recommendation-engine
critical: false # Can degrade gracefully
Tooling can now do what humans consistently fail to do: validate that your targets are achievable given your dependency chain. If you promise 99.99% but your critical dependencies can only deliver 99.75% combined, the validation fails before you've made a promise you can't keep.
This also creates an objective basis for architectural decisions. 'We need to improve auth-service reliability before checkout can hit its target' becomes a provable statement, not an opinion.
Templates: Consistency at Scale
If you have 200 services, you don't want 200 bespoke reliability definitions. OpenSRM supports templates that establish organisational defaults:
yaml
apiVersion: opensrm.io/v1
kind: Template
metadata:
name: api-critical
spec:
type: api
slos:
availability:
target: 0.9999
window: 30d
latency:
p99: 300ms
target: 0.995
ownership:
oncall_required: true
Services inherit from templates and override only what's different:
yaml
metadata:
name: checkout-service
template: api-critical
spec:
slos:
latency:
p99: 200ms # Tighter than template default
This is how you get consistency without rigidity. Platform teams define the standards; service teams customise where needed.
AI Gates: A New Kind of Service, A New Kind of SLO
AI systems are increasingly deployed as 'gates' in production workflows: code-review bots that approve or reject PRs, content moderators that publish or flag content, fraud detectors that allow or block transactions.
These systems can be available, fast, and return valid responses while consistently making terrible decisions. Traditional SLOs measure the system, not the judgment.
OpenSRM introduces a new service type and new SLO categories for this:
yaml
apiVersion: opensrm.io/v1
kind: ServiceReliabilityManifest
metadata:
name: code-review-bot
spec:
type: ai-gate
slos:
availability:
target: 0.999
window: 30d
latency:
p99: 45s
target: 0.99
judgment: # New category
reversal_rate:
target: 0.05 # ≤5% of decisions overridden by humans
window: 30d
observation_period: 24h
high_confidence_failure:
target: 0.02 # ≤2% confident-and-wrong
window: 30d
confidence_threshold: 0.9
Reversal rate (i.e. how often humans override the AI's decision) is the key metric here. It requires no ground-truth labelling, no ML pipeline, no delayed evaluation. You just track 'AI said approve, human said reject.' This is measurable in production today.
I'll be writing more about judgment SLOs in a follow-up, but the key insight is this: as AI systems take on more consequential decisions, we need SLOs that measure decision quality, not just system health.
NthLayer: The Reference Implementation
OpenSRM is a specification. NthLayer is a tool that implements it.
Given an OpenSRM manifest, NthLayer can:
- Validate the manifest against the schema
- Validate against declared dependencies
- Generate Prometheus alerting rules
- Generate Grafana dashboards
- Generate OpenSLO specifications
- Verify that declared metrics actually exist in your monitoring stack
- Gate deployments based on error-budget status
bash
$ nthlayer validate service.reliability.yaml
✓ Schema valid
✓ Dependencies resolvable
✓ Targets achievable (99.75% ceiling from dependencies)
$ nthlayer apply service.reliability.yaml
Generated: prometheus-rules.yaml (12 rules)
Generated: grafana-dashboard.json
Generated: openslo-spec.yaml
$ nthlayer check-deploy --service checkout-service
✓ Deployment allowed
- availability: 99.97% (target: 99.99%)
- error budget remaining: 4.2 hours
- no blocking incidents
The goal is that 'is this service ready for production?' becomes a deterministic question with a checkable answer, not a subjective judgment call in a PR review.
Why Now?
SLOs have had nearly a decade to mature since the Google SRE handbook popularised them. GitOps has normalised declarative configuration. Platform engineering has emerged as a discipline. The concepts are ready, but the tooling has lagged behind.
Meanwhile, AI systems are now being deployed into production faster than our reliability practices can adapt. We're still measuring AI services the same way we measure CRUD APIs, even though the failure modes are fundamentally different.
OpenSRM is my attempt to codify what I've learned about reliability engineering into something others can use, extend, and contribute to. The specification is open. The reference implementation is open source. The goal is a standard that makes reliability engineering more consistent, more automated, and more adapted to the systems we're actually building.
The OpenSRM specification is available at github.com/rsionnach/opensrm.
NthLayer, the reference implementation, is at github.com/rsionnach/nthlayer.
I'm particularly interested in feedback on:
- The judgment SLO model
- Is decision reversal rate the right primary metric for AI gate quality, and what other signals matter?
- Dependency validation
- How should tooling handle partial dependency information, given that not everyone will have a complete service graph on day one?
- Template inheritance
- Is shallow merge the right model, or do teams need more sophisticated inheritance?
Star the repos, open issues, or tell me where I'm wrong. Reliability shouldn't be something we figure out in postmortems.
Rob Fox is a Senior Site Reliability Engineer building open-source reliability tooling. Previously: Shift-Left Reliability.
Top comments (0)