DEV Community: Rob Fox

Your AI Agent Is Available, Fast, and Making Terrible Decisions

Rob Fox — Fri, 27 Feb 2026 22:20:30 +0000

Your code review bot has 99.9% availability. Median response time is under two seconds. It hasn't thrown an error in weeks.

It's also approving PRs with critical security vulnerabilities, rejecting clean code because it doesn't like the variable names, and your senior engineers are quietly overriding it dozens of times a day. Nobody's tracking that. Nobody even has a dashboard for it.

This is the state of AI reliability in 2026: we're measuring the system, not the judgment.

The Widening Gap

SLOs have been the gold standard for service reliability since the Google SRE handbook popularised them nearly a decade ago. Availability. Latency. Error rate. Throughput. These metrics tell you whether a service is up and responsive. They're essential. They're also completely insufficient for AI systems that make decisions.

Consider the systems being deployed right now: code-review bots that approve or reject PRs, content moderators that publish or flag posts, fraud detectors that allow or block transactions, triage agents that route incidents to teams. These are binary decision-makers embedded in critical workflows.

Every existing observability tool monitors the same things: token usage, latency, cost per request, trace depth, error rates. Langfuse, Arize Phoenix, Datadog LLM Observability, LangSmith, Braintrust: they all give you operational metrics. Some offer evaluation frameworks. None of them answer the question that actually matters: is this agent making good decisions in production, right now, continuously?

That's the gap. And it's growing wider every week as teams deploy more autonomous systems into production.

What a Judgment SLO Looks Like

I've been building reliability tooling for a while now, first NthLayer, then the OpenSRM specification. The further I get into AI systems, the more I realise we need a new category of SLO entirely. Not a replacement for availability and latency, but an addition to them.

I'm calling them judgment SLOs. They measure decision quality the same way traditional SLOs measure system health: as a target, over a window, with an error budget.

The key insight is that you don't need ground-truth labels to measure decision quality. You need human overrides. This is the Human-in-the-Loop (HITL) that you've likely read about in many AI articles and whitepapers.

Reversal Rate: The Metric That Already Exists in Your Data

Every AI decision system with a human in the loop already has this signal. The AI says approve, a human says reject. The AI flags content, a human unflags it. The AI blocks a transaction, a human allows it through. These are reversals, meaning cases where a human reviewed the AI's decision and disagreed with the action it took.

Reversal rate is the percentage of AI decisions that get overridden by humans within an observation window:

reversal_rate = human_overrides / total_ai_decisions (over observation_period)

This metric is powerful for three reasons. First, it requires zero labelling infrastructure. You don't need a ground-truth dataset. You don't need an ML pipeline. You just need to track two events: 'AI made a decision' and 'human changed it.' Second, it uses human judgment as the quality signal. In most production systems, when a human overrides an AI, the human is right. Not always, but often enough that the override rate is a meaningful quality indicator. Third, it's measurable today. If you have any kind of human review process, you already have this data. You're just not treating it as an SLO.

The following is what a judgment SLO looks like in an OpenSRM manifest:

apiVersion: opensrm/v1
kind: ServiceReliabilityManifest
metadata:
  name: code-review-bot
spec:
  type: ai-gate
  slos:
    availability:
      target: 0.999
      window: 30d
    latency:
      p99: 45s
      target: 0.99
    judgment:
      reversal:
        rate:
          target: 0.05       # 5% of decisions overridden by humans
          window: 30d
          observation_period: 24h
        high_confidence_failure:
          target: 0.02       # 2% confident-and-wrong
          window: 30d
          confidence_threshold: 0.9

The observation_period matters. A decision isn't considered 'final' until humans have had time to review it. For a code-review bot, 24 hours is reasonable. For a fraud detector, it might be minutes. For a content moderator, it could be a week. The period defines how long you wait before counting a decision as uncontested.

Beyond Reversal Rate: High-Confidence Failure

Reversal rate is the foundation, but it has a blind spot: it only captures cases where humans actually review the decision. If your AI approves something with high confidence and nobody looks at it, a bad decision goes unmeasured.

That's where high-confidence failure (HCF) comes in. HCF tracks cases where the AI was confident and wrong, meaning decisions made above a specified confidence threshold that were subsequently reversed.

high_confidence_failure = reversals_above_threshold / decisions_above_threshold

An AI system with a 4% reversal rate might look healthy. But if its high-confidence failures are at 8%, something is seriously wrong: the model is confidently wrong, which means the decisions least likely to be reviewed are the ones most likely to be bad. That's a fundamentally different risk profile from an AI that's uncertain and wrong.

HCF is the metric that tells you whether you can trust the AI's confidence scores. If confidence doesn't correlate with correctness, you can't use confidence to decide what to review. And if you can't decide what to review, you either review everything (defeating the purpose of automation) or miss the failures that matter most.

What This Makes Possible

Once you define judgment SLOs, several things follow.

Error budgets for decision quality. Just like traditional SLOs, a judgment SLO creates an error budget. A 5% reversal rate target over 30 days means you can tolerate a certain number of bad decisions before the budget is exhausted. When the budget runs low, you can gate deployments, increase human review rates, or reduce the AI's autonomy. These are the same operational responses you'd use for an availability SLO breach.
Alerting on quality degradation. A reversal rate SLO generates Prometheus alerting rules like any other SLO. Burn-rate alerts tell you when decision quality is degrading faster than the budget can absorb. You don't need an ML engineer to notice a drift; your existing on-call process catches it.
Deployment gates. Before shipping a new model version, check the judgment SLO. If the current model is already close to exhausting its decision quality budget, deploying a new version is risky. This is the same logic teams use for availability-based deployment gates, applied to decision quality.
Dependency math. If your checkout flow depends on a fraud detection agent, the quality of the fraud agent's decisions constrains the reliability of the checkout flow. OpenSRM's dependency validation can express this: your service's judgment quality ceiling is bounded by the worst judgment SLO in its critical path.

The Instrumentation Problem

The missing piece right now is standardised telemetry. There's no OpenTelemetry semantic convention for 'AI made a decision' or 'human overrode it.' I've been working on proposals for gen_ai.decision.* and gen_ai.override.* attributes that would make this data portable across vendors and tools. Without that standard, every team rolls their own event schema, and tooling can't be built generically.

The events are simple:

gen_ai.decision.outcome: approve | reject | flag | route
gen_ai.decision.confidence: 0.0 - 1.0
gen_ai.decision.class: code_review | content_moderation | fraud_detection
gen_ai.override.original_outcome: approve
gen_ai.override.new_outcome: reject
gen_ai.override.actor: human | automated_policy

Two events. That's what it takes to compute reversal rate. The tooling to generate Prometheus recording rules, Grafana dashboards, and alerting from these events can be fully automated once the schema exists. That's what NthLayer does for traditional SLOs, and it's what I'm extending it to do for judgment SLOs.

Why This Matters Now

AI agents are multiplying in production faster than our reliability practices are evolving. Every week, another team deploys an autonomous agent into a critical workflow. The observability vendors are building traces, cost tracking, and latency dashboards. The ML teams are building offline evals and prompt testing frameworks. Nobody is building the continuous, production-time measurement of decision quality that SREs need to actually run these systems.

The question isn't whether AI agents need SLOs on their judgment. The question is whether we'll build the practice proactively or wait until a high-profile failure forces it.

We have the patterns. SLOs are a solved problem. Error budgets work. Prometheus can compute any ratio. The only thing missing is the recognition that decision quality is a reliability concern, not just an ML concern, and that it deserves the same operational rigour we give to availability.

The OpenSRM specification, including the type: ai-gate judgment SLO model, is at github.com/rsionnach/opensrm.

NthLayer, the CLI that generates Prometheus rules and Grafana dashboards from reliability manifests, is at github.com/rsionnach/nthlayer.

I'm actively working on the judgment SLO specification model (is reversal rate the right primary metric, and what signals am I missing?), OpenTelemetry semantic convention proposals for gen_ai.decision.* and gen_ai.override.*, and NthLayer support for generating judgment SLO recording rules and dashboards.

If you're running AI agents in production and manually tracking override rates in spreadsheets (or not tracking them at all), I'd like to hear what you're seeing. Open an issue, or find me on the CNCF Slack.

Decision quality is a reliability problem. Let's treat it like one.

Rob Fox is a Senior Site Reliability Engineer building open-source reliability tooling. Previously: Shift-Left Reliability, OpenSRM: An Open Specification for Service Reliability.

OpenSRM: An Open Specification for Service Reliability

Rob Fox — Tue, 24 Feb 2026 21:21:04 +0000

A team sets a 99.99% availability target for their checkout service. It's ambitious but achievable: they've done the work, invested in redundancy, and their metrics look solid.

Six months later, they're missing their target every single month. The postmortem reveals the problem: their critical path flows through three upstream services. The authentication service promises 99.9%. The payment gateway promises 99.95%. The inventory service promises 99.9%.

The math is straightforward: 0.999 × 0.9995 × 0.999 = 0.9975. Their theoretical ceiling is 99.75%, not 99.99%. The target for the checkout service was impossible from day one.

Nobody caught this because there's no standard way to express it. SLOs are set per-service, in isolation. Dependency information lives in architecture diagrams that nobody updates, service catalogs that are perpetually stale, and the heads of engineers who've since left the company. Nobody owns the cross-service math.

This is one of the things I've been building toward with NthLayer, and it's why I've developed OpenSRM: an open specification for declaring service reliability requirements as code.

The problem: Reliability is Bespoke and Ungoverned

I've seen teams running identical technology stacks (the same Kubernetes clusters, the same Prometheus instances, the same on-call rotations) with completely different SLOs, metrics, and alerting strategies. Not because one approach was better than another, but because nobody told them what to implement.

Services regularly move into production with SLOs being created months later, or never. Dashboards are missing, insufficient, or inconsistent. 'Looks fine to me' during PR reviews. Tribal knowledge. Varying levels of understanding across teams.

We have version control for code. We have version control for infrastructure. Security has transformed with shift-left practices, finding vulnerabilities as code is written rather than after deployment. But reliability? Still fundamentally bespoke and ungoverned.

What OpenSRM Looks Like

OpenSRM (Open Service Reliability Manifest) is a declarative YAML specification for service reliability requirements. A manifest defines what 'production-ready' means for a service: its SLO targets, ownership, dependencies, and the contracts it makes with other services.

Here's a basic manifest:

yaml
apiVersion: opensrm.io/v1
kind: ServiceReliabilityManifest
metadata:
  name: checkout-service
  tier: critical
  template: api-critical

spec:
  type: api

  slos:
    availability:
      target: 0.9999
      window: 30d
    latency:
      p99: 200ms
      target: 0.995
      window: 30d

  ownership:
    team: platform-checkout
    slack: "#checkout-oncall"
    pagerduty: CHECKOUT_CRITICAL
    runbook: https://wiki.example.com/runbooks/checkout

The syntax is deliberately boring. The value isn't in clever YAML; it's in having a standard format that tooling can validate, generate from, and enforce.

Contracts: Separating Internal Targets from External Promises

One of the toughest problems in cross-team reliability is the gap between what a service measures internally and what it promises to others. Your payment service might target 99.995% availability internally (i.e. what you alert on), but only promise 99.99% externally (i.e. what dependent teams can rely on). OpenSRM makes this explicit with contracts:

yaml
spec:
  contract:                           # What I promise to dependents
    availability: 0.9999
    latency:
      p99: 300ms

  slos:                               # Internal targets (tighter)
    availability:
      target: 0.99995                 # Buffer above contract
    latency:
      p99: 200ms                      # Headroom below contract

This separation eliminates a category of cross-team arguments. Your internal SLOs are your business. Your contract is what others can depend on. The specification makes the boundary explicit.

Dependencies: The Math Nobody Does

Returning to our checkout service example, with OpenSRM dependency expectations become declarative:

yaml
spec:
  dependencies:
    - service: auth-service
      critical: true
      expects:
        availability: 0.999
        latency:
          p99: 100ms

    - service: payment-gateway
      critical: true
      expects:
        availability: 0.9995
        latency:
          p99: 200ms

    - service: inventory-service
      critical: true
      expects:
        availability: 0.999

    - service: recommendation-engine
      critical: false              # Can degrade gracefully

Tooling can now do what humans consistently fail to do: validate that your targets are achievable given your dependency chain. If you promise 99.99% but your critical dependencies can only deliver 99.75% combined, the validation fails before you've made a promise you can't keep.

This also creates an objective basis for architectural decisions. 'We need to improve auth-service reliability before checkout can hit its target' becomes a provable statement, not an opinion.

Templates: Consistency at Scale

If you have 200 services, you don't want 200 bespoke reliability definitions. OpenSRM supports templates that establish organisational defaults:

yaml
apiVersion: opensrm.io/v1
kind: Template
metadata:
  name: api-critical
spec:
  type: api
  slos:
    availability:
      target: 0.9999
      window: 30d
    latency:
      p99: 300ms
      target: 0.995
  ownership:
    oncall_required: true

Services inherit from templates and override only what's different:

yaml
metadata:
  name: checkout-service
  template: api-critical
spec:
  slos:
    latency:
      p99: 200ms                   # Tighter than template default

This is how you get consistency without rigidity. Platform teams define the standards; service teams customise where needed.

AI Gates: A New Kind of Service, A New Kind of SLO

AI systems are increasingly deployed as 'gates' in production workflows: code-review bots that approve or reject PRs, content moderators that publish or flag content, fraud detectors that allow or block transactions.

These systems can be available, fast, and return valid responses while consistently making terrible decisions. Traditional SLOs measure the system, not the judgment.

OpenSRM introduces a new service type and new SLO categories for this:

yaml
apiVersion: opensrm.io/v1
kind: ServiceReliabilityManifest
metadata:
  name: code-review-bot
spec:
  type: ai-gate

  slos:
    availability:
      target: 0.999
      window: 30d
    latency:
      p99: 45s
      target: 0.99

    judgment:                        # New category
      reversal_rate:
        target: 0.05                 # ≤5% of decisions overridden by humans
        window: 30d
        observation_period: 24h
      high_confidence_failure:
        target: 0.02                 # ≤2% confident-and-wrong
        window: 30d
        confidence_threshold: 0.9

Reversal rate (i.e. how often humans override the AI's decision) is the key metric here. It requires no ground-truth labelling, no ML pipeline, no delayed evaluation. You just track 'AI said approve, human said reject.' This is measurable in production today.

I'll be writing more about judgment SLOs in a follow-up, but the key insight is this: as AI systems take on more consequential decisions, we need SLOs that measure decision quality, not just system health.

NthLayer: The Reference Implementation

OpenSRM is a specification. NthLayer is a tool that implements it.
Given an OpenSRM manifest, NthLayer can:

Validate the manifest against the schema
Validate against declared dependencies
Generate Prometheus alerting rules
Generate Grafana dashboards
Generate OpenSLO specifications
Verify that declared metrics actually exist in your monitoring stack
Gate deployments based on error-budget status

bash
$ nthlayer validate service.reliability.yaml
✓ Schema valid
✓ Dependencies resolvable  
✓ Targets achievable (99.75% ceiling from dependencies)

$ nthlayer apply service.reliability.yaml
Generated: prometheus-rules.yaml (12 rules)
Generated: grafana-dashboard.json
Generated: openslo-spec.yaml

$ nthlayer check-deploy --service checkout-service
✓ Deployment allowed
  - availability: 99.97% (target: 99.99%)
  - error budget remaining: 4.2 hours
  - no blocking incidents

The goal is that 'is this service ready for production?' becomes a deterministic question with a checkable answer, not a subjective judgment call in a PR review.

Why Now?

SLOs have had nearly a decade to mature since the Google SRE handbook popularised them. GitOps has normalised declarative configuration. Platform engineering has emerged as a discipline. The concepts are ready, but the tooling has lagged behind.

Meanwhile, AI systems are now being deployed into production faster than our reliability practices can adapt. We're still measuring AI services the same way we measure CRUD APIs, even though the failure modes are fundamentally different.

OpenSRM is my attempt to codify what I've learned about reliability engineering into something others can use, extend, and contribute to. The specification is open. The reference implementation is open source. The goal is a standard that makes reliability engineering more consistent, more automated, and more adapted to the systems we're actually building.

The OpenSRM specification is available at github.com/rsionnach/opensrm.
NthLayer, the reference implementation, is at github.com/rsionnach/nthlayer.

I'm particularly interested in feedback on:

The judgment SLO model
- Is decision reversal rate the right primary metric for AI gate quality, and what other signals matter?
Dependency validation
- How should tooling handle partial dependency information, given that not everyone will have a complete service graph on day one?
Template inheritance
- Is shallow merge the right model, or do teams need more sophisticated inheritance?

Star the repos, open issues, or tell me where I'm wrong. Reliability shouldn't be something we figure out in postmortems.

Rob Fox is a Senior Site Reliability Engineer building open-source reliability tooling. Previously: Shift-Left Reliability.

Shift-Left Reliability

Rob Fox — Mon, 12 Jan 2026 21:34:19 +0000

We've become exceptionally good at incident response. Modern teams restore service quickly, run thoughtful postmortems, and hold themselves accountable through corrective actions.

And yet…

A team ships a change that passes every test, gets all the required approvals, and still brings down checkout for 47 minutes. The postmortem conclusion? "We should have known our latency SLO was already at 94% before deploying."

Many postmortems point to the same root cause: changes we introduced ourselves. Not hardware failures. Not random outages. Just software behaving exactly as we told it to.

We continue to treat reliability as something to evaluate once those changes are already live. This isn't a failure of tooling or process. It's a question of when we decide whether a system is ready.

The paradox

We've invested heavily in observing and responding to failure - better alerting, faster incident response, thorough postmortems. Teams care deeply about reliability and spend significant time optimizing how they respond to incidents.

But when in a service's lifecycle are they supposed to define reliability? Where's the innovation that happens before deployment?

Where reliability decisions actually happen today

I've seen multiple teams running identical technology stacks with completely different SLOs, metrics, and alerts. Nobody told them what to implement, what's best-practice or how to tune their alerts. They want to be good reliability citizens, but getting from the theory in the handbook to putting that theory into practice is not straightforward.

Services regularly move into production with SLOs being created months later - or never. Dashboards are missing, insufficient, or inconsistent. "Looks fine to me" during PR reviews. Tribal knowledge. Varying levels of understanding across teams.

Reliability is fundamentally bespoke and ungoverned. That's the core issue.

The missing layer

GitHub gave us version control for code. Terraform gave us version control for infrastructure. Security has transformed with shift-left - finding flaws as code is written, not after deployment.

We're still missing version control for reliability.

We need a specification that defines requirements, validates them against reality, and generates the artifacts: dashboards, SLOs, alerts, escalation policies. If the specification is validated and the artifacts created, the same tool can check in real-time whether a service is in breach - and block high-risk deployments in CI/CD.

What shift-left reliability actually means

Shift-left reliability doesn't mean more alerts and dashboards, more postmortems or more people in the room.
It means:

Spec - Define reliability requirements as code before production deployment
Validate - Test those requirements against reality
Enforce - Gate deployments through CI/CD

Engineers don't write PromQL or Grafana JSON - they declare intent,
and reliability becomes deterministic. Outcomes are predictable,
consistent, transparent, and follow best practice.

An executable reliability contract

Keep it simple. A team creates a service.yaml file with their reliability intent:

name: payment-api
tier: critical
type: api
team: payments
dependencies:
 - postgresql
 - redis

Here is a complete service.yaml example.

Tooling validates metrics, SLOs, and error budgets then generates these artifacts automatically. This is the approach I am exploring with an open-source project called NthLayer.

NthLayer runs in any CI/CD pipeline - GitHub Actions, ArgoCD, Jenkins, Tekton, GitLab CI. The goal isn't to be an inflexible blocker; it's visible risk and explicit decisions. Overrides are fine when they're intentional, logged, and owned.

When a deployment is attempted, the specification is evaluated against reality:

$ nthlayer check-deploy - service payment-api
 ERROR: Deployment blocked
 - availability SLO at 99.2% (target: 99.95%)
 - error budget exhausted: -47 minutes remaining
 - 3 P1 incidents in last 7 days

Exit code: 2 (BLOCKED)

Why now?

SLOs have had 8+ years to mature and move from the Google SRE Handbook into mainstream practice. GitOps has normalized declarative configuration. Platform Engineering has matured as a discipline. The concepts are ready but the tooling has lagged behind.

This is a deliberate shift in approach. Reliability is no longer up for debate during incidents. Services have defined owners with deterministic standards. We can stop reinventing the reliability wheel every time a new service is onboarded. If requirements change, update the service.yaml, run NthLayer and every service benefits from adopting the new standard.

What this does not replace

NthLayer doesn't replace service catalogs, developer portals, observability platforms, or incident management. It doesn't predict failures or eliminate human judgment. It's upstream of all these systems.

The goal: a reliability specification, automated deployment gates and to reduce cognitive load to implement best practices.

Open questions

I don't have all the answers but two questions I keep returning to are:

Contract Drift: What happens when the spec says 99.95% but reality has been 99.5% for months? Is the contract wrong, or is the service broken?
Emergency Overrides: How should they work? Who approves? How do you prevent them from becoming the default?

The timing problem

Where do reliability decisions actually happen in your organization? What would it look like to decide readiness before deployment? What reliability rules do you wish you could enforce automatically?

The timing problem isn't going away. The only question is whether you address it before deployment - or learn about it in the postmortem.

NthLayer is open source and looking for early adopters. If you're tired of reliability being an afterthought:

pip install nthlayer
nthlayer init
nthlayer check-deploy --service your-service

→ github.com/rsionnach/nthlayer

Star the repo, open an issue, or tell me I'm wrong. I want to hear how reliability decisions happen in your organization.

Rob Fox is a Senior Site Reliability Engineer focused on platform and reliability tooling. He's exploring how reliability engineering can move earlier in the software delivery lifecycle. Find him on GitHub.