Rob Fox

Posted on Feb 27

Your AI Agent Is Available, Fast, and Making Terrible Decisions

#agents #ai #softwareengineering #sre

Your code review bot has 99.9% availability. Median response time is under two seconds. It hasn't thrown an error in weeks.

It's also approving PRs with critical security vulnerabilities, rejecting clean code because it doesn't like the variable names, and your senior engineers are quietly overriding it dozens of times a day. Nobody's tracking that. Nobody even has a dashboard for it.

This is the state of AI reliability in 2026: we're measuring the system, not the judgment.

The Widening Gap

SLOs have been the gold standard for service reliability since the Google SRE handbook popularised them nearly a decade ago. Availability. Latency. Error rate. Throughput. These metrics tell you whether a service is up and responsive. They're essential. They're also completely insufficient for AI systems that make decisions.

Consider the systems being deployed right now: code-review bots that approve or reject PRs, content moderators that publish or flag posts, fraud detectors that allow or block transactions, triage agents that route incidents to teams. These are binary decision-makers embedded in critical workflows.

Every existing observability tool monitors the same things: token usage, latency, cost per request, trace depth, error rates. Langfuse, Arize Phoenix, Datadog LLM Observability, LangSmith, Braintrust: they all give you operational metrics. Some offer evaluation frameworks. None of them answer the question that actually matters: is this agent making good decisions in production, right now, continuously?

That's the gap. And it's growing wider every week as teams deploy more autonomous systems into production.

What a Judgment SLO Looks Like

I've been building reliability tooling for a while now, first NthLayer, then the OpenSRM specification. The further I get into AI systems, the more I realise we need a new category of SLO entirely. Not a replacement for availability and latency, but an addition to them.

I'm calling them judgment SLOs. They measure decision quality the same way traditional SLOs measure system health: as a target, over a window, with an error budget.

The key insight is that you don't need ground-truth labels to measure decision quality. You need human overrides. This is the Human-in-the-Loop (HITL) that you've likely read about in many AI articles and whitepapers.

Reversal Rate: The Metric That Already Exists in Your Data

Every AI decision system with a human in the loop already has this signal. The AI says approve, a human says reject. The AI flags content, a human unflags it. The AI blocks a transaction, a human allows it through. These are reversals, meaning cases where a human reviewed the AI's decision and disagreed with the action it took.

Reversal rate is the percentage of AI decisions that get overridden by humans within an observation window:

reversal_rate = human_overrides / total_ai_decisions (over observation_period)

This metric is powerful for three reasons. First, it requires zero labelling infrastructure. You don't need a ground-truth dataset. You don't need an ML pipeline. You just need to track two events: 'AI made a decision' and 'human changed it.' Second, it uses human judgment as the quality signal. In most production systems, when a human overrides an AI, the human is right. Not always, but often enough that the override rate is a meaningful quality indicator. Third, it's measurable today. If you have any kind of human review process, you already have this data. You're just not treating it as an SLO.

The following is what a judgment SLO looks like in an OpenSRM manifest:

apiVersion: opensrm/v1
kind: ServiceReliabilityManifest
metadata:
  name: code-review-bot
spec:
  type: ai-gate
  slos:
    availability:
      target: 0.999
      window: 30d
    latency:
      p99: 45s
      target: 0.99
    judgment:
      reversal:
        rate:
          target: 0.05       # 5% of decisions overridden by humans
          window: 30d
          observation_period: 24h
        high_confidence_failure:
          target: 0.02       # 2% confident-and-wrong
          window: 30d
          confidence_threshold: 0.9

The observation_period matters. A decision isn't considered 'final' until humans have had time to review it. For a code-review bot, 24 hours is reasonable. For a fraud detector, it might be minutes. For a content moderator, it could be a week. The period defines how long you wait before counting a decision as uncontested.

Beyond Reversal Rate: High-Confidence Failure

Reversal rate is the foundation, but it has a blind spot: it only captures cases where humans actually review the decision. If your AI approves something with high confidence and nobody looks at it, a bad decision goes unmeasured.

That's where high-confidence failure (HCF) comes in. HCF tracks cases where the AI was confident and wrong, meaning decisions made above a specified confidence threshold that were subsequently reversed.

high_confidence_failure = reversals_above_threshold / decisions_above_threshold

An AI system with a 4% reversal rate might look healthy. But if its high-confidence failures are at 8%, something is seriously wrong: the model is confidently wrong, which means the decisions least likely to be reviewed are the ones most likely to be bad. That's a fundamentally different risk profile from an AI that's uncertain and wrong.

HCF is the metric that tells you whether you can trust the AI's confidence scores. If confidence doesn't correlate with correctness, you can't use confidence to decide what to review. And if you can't decide what to review, you either review everything (defeating the purpose of automation) or miss the failures that matter most.

What This Makes Possible

Once you define judgment SLOs, several things follow.

Error budgets for decision quality. Just like traditional SLOs, a judgment SLO creates an error budget. A 5% reversal rate target over 30 days means you can tolerate a certain number of bad decisions before the budget is exhausted. When the budget runs low, you can gate deployments, increase human review rates, or reduce the AI's autonomy. These are the same operational responses you'd use for an availability SLO breach.
Alerting on quality degradation. A reversal rate SLO generates Prometheus alerting rules like any other SLO. Burn-rate alerts tell you when decision quality is degrading faster than the budget can absorb. You don't need an ML engineer to notice a drift; your existing on-call process catches it.
Deployment gates. Before shipping a new model version, check the judgment SLO. If the current model is already close to exhausting its decision quality budget, deploying a new version is risky. This is the same logic teams use for availability-based deployment gates, applied to decision quality.
Dependency math. If your checkout flow depends on a fraud detection agent, the quality of the fraud agent's decisions constrains the reliability of the checkout flow. OpenSRM's dependency validation can express this: your service's judgment quality ceiling is bounded by the worst judgment SLO in its critical path.

The Instrumentation Problem

The missing piece right now is standardised telemetry. There's no OpenTelemetry semantic convention for 'AI made a decision' or 'human overrode it.' I've been working on proposals for gen_ai.decision.* and gen_ai.override.* attributes that would make this data portable across vendors and tools. Without that standard, every team rolls their own event schema, and tooling can't be built generically.

The events are simple:

gen_ai.decision.outcome: approve | reject | flag | route
gen_ai.decision.confidence: 0.0 - 1.0
gen_ai.decision.class: code_review | content_moderation | fraud_detection
gen_ai.override.original_outcome: approve
gen_ai.override.new_outcome: reject
gen_ai.override.actor: human | automated_policy

Two events. That's what it takes to compute reversal rate. The tooling to generate Prometheus recording rules, Grafana dashboards, and alerting from these events can be fully automated once the schema exists. That's what NthLayer does for traditional SLOs, and it's what I'm extending it to do for judgment SLOs.

Why This Matters Now

AI agents are multiplying in production faster than our reliability practices are evolving. Every week, another team deploys an autonomous agent into a critical workflow. The observability vendors are building traces, cost tracking, and latency dashboards. The ML teams are building offline evals and prompt testing frameworks. Nobody is building the continuous, production-time measurement of decision quality that SREs need to actually run these systems.

The question isn't whether AI agents need SLOs on their judgment. The question is whether we'll build the practice proactively or wait until a high-profile failure forces it.

We have the patterns. SLOs are a solved problem. Error budgets work. Prometheus can compute any ratio. The only thing missing is the recognition that decision quality is a reliability concern, not just an ML concern, and that it deserves the same operational rigour we give to availability.

The OpenSRM specification, including the type: ai-gate judgment SLO model, is at github.com/rsionnach/opensrm.

NthLayer, the CLI that generates Prometheus rules and Grafana dashboards from reliability manifests, is at github.com/rsionnach/nthlayer.

I'm actively working on the judgment SLO specification model (is reversal rate the right primary metric, and what signals am I missing?), OpenTelemetry semantic convention proposals for gen_ai.decision.* and gen_ai.override.*, and NthLayer support for generating judgment SLO recording rules and dashboards.

If you're running AI agents in production and manually tracking override rates in spreadsheets (or not tracking them at all), I'd like to hear what you're seeing. Open an issue, or find me on the CNCF Slack.

Decision quality is a reliability problem. Let's treat it like one.

Rob Fox is a Senior Site Reliability Engineer building open-source reliability tooling. Previously: Shift-Left Reliability, OpenSRM: An Open Specification for Service Reliability.

DEV Community