Rob Fox

Posted on Feb 24

OpenSRM: An Open Specification for Service Reliability

#sre #observability #specification #sitereliabilityengineering

A team sets a 99.99% availability target for their checkout service. It's ambitious but achievable: they've done the work, invested in redundancy, and their metrics look solid.

Six months later, they're missing their target every single month. The postmortem reveals the problem: their critical path flows through three upstream services. The authentication service promises 99.9%. The payment gateway promises 99.95%. The inventory service promises 99.9%.

The math is straightforward: 0.999 × 0.9995 × 0.999 = 0.9975. Their theoretical ceiling is 99.75%, not 99.99%. The target for the checkout service was impossible from day one.

Nobody caught this because there's no standard way to express it. SLOs are set per-service, in isolation. Dependency information lives in architecture diagrams that nobody updates, service catalogs that are perpetually stale, and the heads of engineers who've since left the company. Nobody owns the cross-service math.

This is one of the things I've been building toward with NthLayer, and it's why I've developed OpenSRM: an open specification for declaring service reliability requirements as code.

The problem: Reliability is Bespoke and Ungoverned

I've seen teams running identical technology stacks (the same Kubernetes clusters, the same Prometheus instances, the same on-call rotations) with completely different SLOs, metrics, and alerting strategies. Not because one approach was better than another, but because nobody told them what to implement.

Services regularly move into production with SLOs being created months later, or never. Dashboards are missing, insufficient, or inconsistent. 'Looks fine to me' during PR reviews. Tribal knowledge. Varying levels of understanding across teams.

We have version control for code. We have version control for infrastructure. Security has transformed with shift-left practices, finding vulnerabilities as code is written rather than after deployment. But reliability? Still fundamentally bespoke and ungoverned.

What OpenSRM Looks Like

OpenSRM (Open Service Reliability Manifest) is a declarative YAML specification for service reliability requirements. A manifest defines what 'production-ready' means for a service: its SLO targets, ownership, dependencies, and the contracts it makes with other services.

Here's a basic manifest:

yaml
apiVersion: opensrm.io/v1
kind: ServiceReliabilityManifest
metadata:
  name: checkout-service
  tier: critical
  template: api-critical

spec:
  type: api

  slos:
    availability:
      target: 0.9999
      window: 30d
    latency:
      p99: 200ms
      target: 0.995
      window: 30d

  ownership:
    team: platform-checkout
    slack: "#checkout-oncall"
    pagerduty: CHECKOUT_CRITICAL
    runbook: https://wiki.example.com/runbooks/checkout

The syntax is deliberately boring. The value isn't in clever YAML; it's in having a standard format that tooling can validate, generate from, and enforce.

Contracts: Separating Internal Targets from External Promises

One of the toughest problems in cross-team reliability is the gap between what a service measures internally and what it promises to others. Your payment service might target 99.995% availability internally (i.e. what you alert on), but only promise 99.99% externally (i.e. what dependent teams can rely on). OpenSRM makes this explicit with contracts:

yaml
spec:
  contract:                           # What I promise to dependents
    availability: 0.9999
    latency:
      p99: 300ms

  slos:                               # Internal targets (tighter)
    availability:
      target: 0.99995                 # Buffer above contract
    latency:
      p99: 200ms                      # Headroom below contract

This separation eliminates a category of cross-team arguments. Your internal SLOs are your business. Your contract is what others can depend on. The specification makes the boundary explicit.

Dependencies: The Math Nobody Does

Returning to our checkout service example, with OpenSRM dependency expectations become declarative:

yaml
spec:
  dependencies:
    - service: auth-service
      critical: true
      expects:
        availability: 0.999
        latency:
          p99: 100ms

    - service: payment-gateway
      critical: true
      expects:
        availability: 0.9995
        latency:
          p99: 200ms

    - service: inventory-service
      critical: true
      expects:
        availability: 0.999

    - service: recommendation-engine
      critical: false              # Can degrade gracefully

Tooling can now do what humans consistently fail to do: validate that your targets are achievable given your dependency chain. If you promise 99.99% but your critical dependencies can only deliver 99.75% combined, the validation fails before you've made a promise you can't keep.

This also creates an objective basis for architectural decisions. 'We need to improve auth-service reliability before checkout can hit its target' becomes a provable statement, not an opinion.

Templates: Consistency at Scale

If you have 200 services, you don't want 200 bespoke reliability definitions. OpenSRM supports templates that establish organisational defaults:

yaml
apiVersion: opensrm.io/v1
kind: Template
metadata:
  name: api-critical
spec:
  type: api
  slos:
    availability:
      target: 0.9999
      window: 30d
    latency:
      p99: 300ms
      target: 0.995
  ownership:
    oncall_required: true

Services inherit from templates and override only what's different:

yaml
metadata:
  name: checkout-service
  template: api-critical
spec:
  slos:
    latency:
      p99: 200ms                   # Tighter than template default

This is how you get consistency without rigidity. Platform teams define the standards; service teams customise where needed.

AI Gates: A New Kind of Service, A New Kind of SLO

AI systems are increasingly deployed as 'gates' in production workflows: code-review bots that approve or reject PRs, content moderators that publish or flag content, fraud detectors that allow or block transactions.

These systems can be available, fast, and return valid responses while consistently making terrible decisions. Traditional SLOs measure the system, not the judgment.

OpenSRM introduces a new service type and new SLO categories for this:

yaml
apiVersion: opensrm.io/v1
kind: ServiceReliabilityManifest
metadata:
  name: code-review-bot
spec:
  type: ai-gate

  slos:
    availability:
      target: 0.999
      window: 30d
    latency:
      p99: 45s
      target: 0.99

    judgment:                        # New category
      reversal_rate:
        target: 0.05                 # ≤5% of decisions overridden by humans
        window: 30d
        observation_period: 24h
      high_confidence_failure:
        target: 0.02                 # ≤2% confident-and-wrong
        window: 30d
        confidence_threshold: 0.9

Reversal rate (i.e. how often humans override the AI's decision) is the key metric here. It requires no ground-truth labelling, no ML pipeline, no delayed evaluation. You just track 'AI said approve, human said reject.' This is measurable in production today.

I'll be writing more about judgment SLOs in a follow-up, but the key insight is this: as AI systems take on more consequential decisions, we need SLOs that measure decision quality, not just system health.

NthLayer: The Reference Implementation

OpenSRM is a specification. NthLayer is a tool that implements it.
Given an OpenSRM manifest, NthLayer can:

Validate the manifest against the schema
Validate against declared dependencies
Generate Prometheus alerting rules
Generate Grafana dashboards
Generate OpenSLO specifications
Verify that declared metrics actually exist in your monitoring stack
Gate deployments based on error-budget status

bash
$ nthlayer validate service.reliability.yaml
✓ Schema valid
✓ Dependencies resolvable  
✓ Targets achievable (99.75% ceiling from dependencies)

$ nthlayer apply service.reliability.yaml
Generated: prometheus-rules.yaml (12 rules)
Generated: grafana-dashboard.json
Generated: openslo-spec.yaml

$ nthlayer check-deploy --service checkout-service
✓ Deployment allowed
  - availability: 99.97% (target: 99.99%)
  - error budget remaining: 4.2 hours
  - no blocking incidents

The goal is that 'is this service ready for production?' becomes a deterministic question with a checkable answer, not a subjective judgment call in a PR review.

Why Now?

SLOs have had nearly a decade to mature since the Google SRE handbook popularised them. GitOps has normalised declarative configuration. Platform engineering has emerged as a discipline. The concepts are ready, but the tooling has lagged behind.

Meanwhile, AI systems are now being deployed into production faster than our reliability practices can adapt. We're still measuring AI services the same way we measure CRUD APIs, even though the failure modes are fundamentally different.

OpenSRM is my attempt to codify what I've learned about reliability engineering into something others can use, extend, and contribute to. The specification is open. The reference implementation is open source. The goal is a standard that makes reliability engineering more consistent, more automated, and more adapted to the systems we're actually building.

The OpenSRM specification is available at github.com/rsionnach/opensrm.
NthLayer, the reference implementation, is at github.com/rsionnach/nthlayer.

I'm particularly interested in feedback on:

The judgment SLO model
- Is decision reversal rate the right primary metric for AI gate quality, and what other signals matter?
Dependency validation
- How should tooling handle partial dependency information, given that not everyone will have a complete service graph on day one?
Template inheritance
- Is shallow merge the right model, or do teams need more sophisticated inheritance?

Star the repos, open issues, or tell me where I'm wrong. Reliability shouldn't be something we figure out in postmortems.

Rob Fox is a Senior Site Reliability Engineer building open-source reliability tooling. Previously: Shift-Left Reliability.

DEV Community