How to Evaluate Any AI SRE Tool A Practitioner's Framework Built From 15 Posts of Production SLIs

#devops #agentaichallenge #automation #ai

Title: How to Evaluate Any AI SRE Tool — A Practitioner's Framework Built From 15 Posts of Production SLIs
Your manager just forwarded you a Gartner report. Analyst recognition of the AI SRE category, sustained on-call pressure, immature trust and governance frameworks, and the need for orchestration rather than disconnected agent experiments all arrived together in 2026. The question landing in every SRE team's backlog right now is: should we buy something, build something, or wait? Sherlocks
I've spent four months building the measurement layer for AI agents from scratch — DQR, TIE, HER, AQDD, RTD, CUR, Pre-Action Gate, Semantic Gap detection. Fifteen posts, an open-source library, and a growing arXiv paper. This post is where that work becomes a vendor evaluation framework.
Every claim in this framework maps to a metric I've already defined. You can verify these against any tool — commercial or open-source.
The Problem With Vendor Benchmarks
Datadog's Bits AI SRE decreases time to resolution by up to 95%. New Relic's users resolved incidents 25% faster than those without AI features. Both numbers are published. Both are real — in the environments they measured. Nova AI OpsInfoQ
The question is whether those environments match yours. A 95% MTTR improvement measured on a system with clean telemetry, well-structured runbooks, and narrow incident categories is a different number than what you'll see in a system with fragmented observability, complex dependency graphs, and novel failure modes.
Vendor benchmarks measure the tool in optimal conditions. Your evaluation needs to measure the tool in your conditions. These five questions give you the framework.
Question 1: Does it instrument the reasoning layer?
The semantic gap — the space between what an agent intended and what it executed — is invisible to infrastructure APM. I wrote about this last week using Sherlocks.ai's research: existing tools observe high-level intent or low-level actions, not the correlation between them.
Ask any vendor: do you track re-planning cycles per task? Can I see how many times the agent changed its approach before completing or escalating? Can I query that history after an incident?
If the answer is "we log prompts and tool calls," that's Layer 1 observability. Useful, necessary, insufficient. You need Layer 3 — one structured record per agent task showing the full decision sequence.
What to look for in a demo: ask them to show you a failed task trace. Does it show you the sequence of re-planning decisions, or just the final outcome and the spans?
Question 2: What is the Human Escalation Rate in their benchmark?
HER — the fraction of agent decisions that escalated to human judgment — is the most honest single metric for how autonomous a tool actually is. A low MTTR number paired with a high HER means humans were doing most of the resolution work, faster because the agent assembled context for them. That's valuable. It's not the same as autonomous remediation.
Ask: in your benchmark environment, what percentage of incidents did the agent resolve without human action? What percentage required human approval before execution? What triggered escalation most often?
These questions reveal whether the tool is an autonomous remediator or a very good assistant. Both are legitimate. Only one of them matches the vendor's headline claim.
Question 3: Does it check SLO state before acting?
An agent that remediates without checking your current error budget can compound a degraded situation. I formalized this in the Pre-Action SRE Gate (Post 13): three checks before any autonomous action — error budget remaining, AQDD state, and the agent's own HER trend.
Ask any vendor: does your agent check SLO error budget before executing a remediation? What happens if the error budget is critically low — does it act anyway or escalate? Can I configure the pre-action gate thresholds?
A tool that doesn't have an answer to this question is not safe for production systems where the error budget is already burning.
Question 4: What is the defined blast radius per agent?
Komodor's Klaudia is trained specifically on pod crashes, failed rollouts, autoscaler friction, misconfigurations, and cascading failures in Kubernetes environments. That specificity is its blast radius. 95% accuracy in that domain does not mean 95% accuracy outside it. Yisusvii
Every AI SRE tool has an implicit blast radius — the set of systems and failure modes it was trained and tested on. Good tools make this explicit. Ask: what systems can this agent modify autonomously? What systems are write-locked? What failure categories is the accuracy claim based on?
If the vendor can't give you a concrete blast radius definition, the accuracy number is a marketing claim. If they can, you can evaluate whether that blast radius covers your actual failure distribution.
Question 5: What is the ownership model when it's wrong?
This is the question vendors like least. When the agent makes a bad remediation decision and compounds the incident, who is accountable? The vendor's SLA covers service availability, not the operational consequences of an agent action.
In your environment, the answer should map to your ARO (Agent Reliability Ownership) registration — a named human owner, a defined escalation path, and an audit log of every gate check the agent ran before acting.
Ask any vendor: does your tool generate an audit log of agent decision reasoning before each action? Is that log queryable during incident review? Who owns the agent's behavior in my environment?
If the audit log doesn't exist, you cannot write a complete postmortem after an agent-involved incident. That's the accountability gap that makes autonomous agents unsafe in regulated production environments.
The Build vs Buy Decision Matrix
Given these five questions, here's how I'd frame the build-vs-buy decision:
Buy if: Your failure distribution maps closely to the tool's blast radius, you don't need custom SLIs beyond what the vendor provides, and the vendor can answer all five questions with specifics.
Build if: Your failure distribution is broad or novel, you need custom SLIs (DQR, RTD, HER, AQDD are all absent from commercial tools today), or you need to satisfy regulatory requirements that mandate audit trails the vendor doesn't generate.
Hybrid (most realistic): Buy the investigation layer — vendor tools are genuinely good at assembling incident context faster than humans. Build the governance layer — Pre-Action Gates, ARO registration, Semantic Gap detection, Sprawl Registry. The agentsre library is designed for exactly this hybrid.
The Evaluation Scorecard
python# agentsre/tool_evaluation.py

from dataclasses import dataclass, field
from typing import Dict, List, Optional
import json
from datetime import datetime, timezone

@dataclass
class ToolEvaluationScore:
"""
Five-question evaluation scorecard for AI SRE tooling.

Use this to evaluate commercial tools or internal builds
against the SLI framework from the agentsre series.

Score each question 0 (no), 1 (partial), 2 (yes).
Total score >= 8: consider for production.
Total score 5-7: pilot with governance layer built separately.
Total score < 5: not production-ready for autonomous operation.
"""
tool_name: str
evaluator: str
environment_context: str  # Brief description of your stack

# Q1: Reasoning layer instrumentation
tracks_replanning_cycles: int = 0    # 0/1/2
can_query_decision_sequence: int = 0
q1_notes: str = ""

# Q2: HER transparency
her_in_benchmark_disclosed: int = 0
autonomous_vs_assisted_split_disclosed: int = 0
q2_notes: str = ""

# Q3: Pre-action SLO gate
checks_error_budget_before_acting: int = 0
gate_thresholds_configurable: int = 0
q3_notes: str = ""

# Q4: Blast radius definition
blast_radius_explicit: int = 0
accuracy_claim_scoped_to_blast_radius: int = 0
q4_notes: str = ""

# Q5: Ownership and audit
generates_decision_audit_log: int = 0
audit_log_queryable_postmortem: int = 0
q5_notes: str = ""

evaluated_at: str = field(
    default_factory=lambda: datetime.now(timezone.utc).isoformat()
)

@property
def total_score(self) -> int:
    return (
        self.tracks_replanning_cycles +
        self.can_query_decision_sequence +
        self.her_in_benchmark_disclosed +
        self.autonomous_vs_assisted_split_disclosed +
        self.checks_error_budget_before_acting +
        self.gate_thresholds_configurable +
        self.blast_radius_explicit +
        self.accuracy_claim_scoped_to_blast_radius +
        self.generates_decision_audit_log +
        self.audit_log_queryable_postmortem
    )

@property
def recommendation(self) -> str:
    if self.total_score >= 8:
        return "CONSIDER: meets production governance bar"
    elif self.total_score >= 5:
        return "PILOT: build governance layer separately before production"
    else:
        return "NOT READY: missing critical governance capabilities"

def to_report(self) -> Dict:
    return {
        "tool": self.tool_name,
        "evaluator": self.evaluator,
        "environment": self.environment_context,
        "scores": {
            "q1_reasoning_layer": {
                "tracks_replanning": self.tracks_replanning_cycles,
                "queryable_decision_sequence": self.can_query_decision_sequence,
                "notes": self.q1_notes
            },
            "q2_her_transparency": {
                "her_disclosed": self.her_in_benchmark_disclosed,
                "autonomous_split_disclosed": self.autonomous_vs_assisted_split_disclosed,
                "notes": self.q2_notes
            },
            "q3_pre_action_gate": {
                "checks_error_budget": self.checks_error_budget_before_acting,
                "configurable_thresholds": self.gate_thresholds_configurable,
                "notes": self.q3_notes
            },
            "q4_blast_radius": {
                "explicit_definition": self.blast_radius_explicit,
                "accuracy_scoped": self.accuracy_claim_scoped_to_blast_radius,
                "notes": self.q4_notes
            },
            "q5_audit_ownership": {
                "audit_log_generated": self.generates_decision_audit_log,
                "queryable_in_postmortem": self.audit_log_queryable_postmortem,
                "notes": self.q5_notes
            }
        },
        "total_score": f"{self.total_score}/20",
        "recommendation": self.recommendation,
        "evaluated_at": self.evaluated_at
    }

def to_json(self) -> str:
    return json.dumps(self.to_report(), indent=2)

Where This Fits in the Arc
Posts 1–14 built the measurement framework: SLIs for agent output quality, control plane reliability, reasoning observability, context management, ownership governance, semantic gap detection.
Post 15 is the practical payoff — you now have a five-question framework, grounded in production SLIs, to evaluate any AI SRE tool your manager asks you to assess. Whether the answer is buy, build, or hybrid, the framework gives you a defensible, technically grounded recommendation.
The ToolEvaluationScore dataclass is in agentsre/tool_evaluation.py. Use it to document your evaluation and generate a report you can share with your team.
Ajay Devineni | AWS Community Builder | Senior SRE/Platform Engineer
github.com/Ajay150313/agentsre | dev.to/ajaydevineni

DEV Community

How to Evaluate Any AI SRE Tool A Practitioner's Framework Built From 15 Posts of Production SLIs

Top comments (0)