<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ajay Devineni</title>
    <description>The latest articles on DEV Community by Ajay Devineni (@ajaydevineni).</description>
    <link>https://dev.to/ajaydevineni</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3862822%2Fddbc52cd-519d-4344-bea2-effb2a513786.png</url>
      <title>DEV Community: Ajay Devineni</title>
      <link>https://dev.to/ajaydevineni</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ajaydevineni"/>
    <language>en</language>
    <item>
      <title>Google Published Their AI SRE Blueprint. Here's the Line-by-Line Mapping to What the Community Has Been Building</title>
      <dc:creator>Ajay Devineni</dc:creator>
      <pubDate>Tue, 09 Jun 2026 01:26:43 +0000</pubDate>
      <link>https://dev.to/ajaydevineni/google-published-their-ai-sre-blueprint-heres-the-line-by-line-mapping-to-what-the-community-has-4ff</link>
      <guid>https://dev.to/ajaydevineni/google-published-their-ai-sre-blueprint-heres-the-line-by-line-mapping-to-what-the-community-has-4ff</guid>
      <description>&lt;p&gt;Google published a white paper on May 28 that every SRE should read.&lt;br&gt;
It details how they're architecting a new foundation for reliability with three core components: AI Operator (autonomous mitigation agents), Actus (strict execution guardrails), and IRM Analyzer (continuous evaluation pipelines grounded in human operational memory). The goal: safely govern high-velocity agentic software development at Google's scale. Rootly&lt;br&gt;
I've been building toward the same architecture from the ground up for couple of months not inside Google, but as an independent practitioner trying to solve the same problem for teams who don't have Google's infrastructure or runway.&lt;br&gt;
Reading the whitepaper, I found that every component Google named maps directly to something already in the agentsre library or this series. This post maps them side by side.&lt;br&gt;
Google's Actus → Pre-Action SRE Gate&lt;br&gt;
Actus is Google's physical execution control plane for safe autonomous mitigation — it bounds what an agent can do in production with strict policy enforcement before any action executes. Rootly&lt;br&gt;
That's exactly what the Pre-Action SRE Gate does. Three checks before any autonomous action: error budget remaining (does the system have headroom?), AQDD state (can humans course-correct if this goes wrong?), and HER trend (is this agent already outside its reliable envelope?). If any check fails — agent escalates, does not act.&lt;br&gt;
Google built Actus at the infrastructure level for internal systems. The Pre-Action Gate is the same pattern implemented as a Lambda + CloudWatch + DynamoDB pattern any AWS team can deploy this week.&lt;br&gt;
Google's IRM Analyzer → DQR + RTD&lt;br&gt;
IRM Analyzer is Google's continuous evaluation pipeline that captures human operational memory and runs nightly evaluations to prove agent readiness before deployment and during operation. Rootly&lt;br&gt;
Two metrics from this series do the same work:&lt;br&gt;
DQR (Decision Quality Rate) — is the agent's output correct? Measured continuously, not just at deployment.&lt;br&gt;
RTD (Reasoning Trace Depth) — is the agent's reasoning stable? Re-planning cycles per task. Rises before DQR falls.&lt;br&gt;
Google runs nightly evals against a corpus of human-validated incidents. For teams without that corpus, DQR and RTD measured in 30-day shadow mode are the approximation that's achievable without Google's internal incident database.&lt;br&gt;
Google's AI Operator → The agent that needs ARO&lt;br&gt;
Google SRE has AI agents that continuously monitor and improve playbooks and production documentation based on their usage during incidents. AI agents can also generate new playbooks from incidents. Nova AI Ops&lt;br&gt;
This is AI Operator in action. And it's exactly the class of agent that needs Agent Reliability Ownership (ARO) registration — a named owner, a defined blast radius, and an escalation path — before it starts writing to production documentation.&lt;br&gt;
An agent that can modify runbooks is an agent that can corrupt the guidance every human SRE relies on during an incident. Blast radius definition isn't optional for that class of agent. It's the most important governance artifact you have.&lt;br&gt;
The gap Google doesn't address — fleet governance&lt;br&gt;
Google's whitepaper covers individual agent governance well. What it doesn't cover — because at Google's scale it's a different problem — is fleet-level governance for teams where engineers are deploying their own agent workflows alongside platform-deployed agents.&lt;br&gt;
That's the Agent Sprawl problem from Post 6. The Sprawl Registry and Postmortem Readiness Rate (PRR) from Post 12 address the fleet-level governance gap that Google's architecture assumes away.&lt;br&gt;
What this means for your team&lt;br&gt;
AI SRE technology is arriving faster than the trust frameworks needed to deploy it safely. Sherlocks AI&lt;br&gt;
Google just published the trust framework for their environment. The agentsre library is the open-source implementation of the same framework for everyone else.&lt;br&gt;
The three components that matter most to implement first, in order:&lt;br&gt;
Start with Pre-Action Gate (Actus equivalent) — because an ungated agent is a liability before it's an asset.&lt;br&gt;
Add DQR + RTD monitoring (IRM Analyzer equivalent) — because you can't evaluate what you don't measure.&lt;br&gt;
Register every agent in ARO + Sprawl Registry (AI Operator governance) — because you can't own what you haven't named.&lt;br&gt;
The whitepaper is at sre.google. The library is at github.com/Ajay150313/agentsre.&lt;br&gt;
What component is your team missing most right now?&lt;br&gt;
Ajay Devineni | AWS Community Builder | IEEE Senior Member Senior SRE/Platform Engineer | github.com/Ajay150313/agentsre&lt;/p&gt;

</description>
      <category>ai</category>
      <category>googlecloud</category>
      <category>sre</category>
      <category>devops</category>
    </item>
    <item>
      <title>How to Evaluate Any AI SRE Tool A Practitioner's Framework Built From 15 Posts of Production SLIs</title>
      <dc:creator>Ajay Devineni</dc:creator>
      <pubDate>Thu, 04 Jun 2026 01:17:39 +0000</pubDate>
      <link>https://dev.to/ajaydevineni/how-to-evaluate-any-ai-sre-tool-a-practitioners-framework-built-from-15-posts-of-production-slis-32ml</link>
      <guid>https://dev.to/ajaydevineni/how-to-evaluate-any-ai-sre-tool-a-practitioners-framework-built-from-15-posts-of-production-slis-32ml</guid>
      <description>&lt;p&gt;Title: How to Evaluate Any AI SRE Tool — A Practitioner's Framework Built From 15 Posts of Production SLIs&lt;br&gt;
Your manager just forwarded you a Gartner report. Analyst recognition of the AI SRE category, sustained on-call pressure, immature trust and governance frameworks, and the need for orchestration rather than disconnected agent experiments all arrived together in 2026. The question landing in every SRE team's backlog right now is: should we buy something, build something, or wait? Sherlocks&lt;br&gt;
I've spent four months building the measurement layer for AI agents from scratch — DQR, TIE, HER, AQDD, RTD, CUR, Pre-Action Gate, Semantic Gap detection. Fifteen posts, an open-source library, and a growing arXiv paper. This post is where that work becomes a vendor evaluation framework.&lt;br&gt;
Every claim in this framework maps to a metric I've already defined. You can verify these against any tool — commercial or open-source.&lt;br&gt;
The Problem With Vendor Benchmarks&lt;br&gt;
Datadog's Bits AI SRE decreases time to resolution by up to 95%. New Relic's users resolved incidents 25% faster than those without AI features. Both numbers are published. Both are real — in the environments they measured. Nova AI OpsInfoQ&lt;br&gt;
The question is whether those environments match yours. A 95% MTTR improvement measured on a system with clean telemetry, well-structured runbooks, and narrow incident categories is a different number than what you'll see in a system with fragmented observability, complex dependency graphs, and novel failure modes.&lt;br&gt;
Vendor benchmarks measure the tool in optimal conditions. Your evaluation needs to measure the tool in your conditions. These five questions give you the framework.&lt;br&gt;
Question 1: Does it instrument the reasoning layer?&lt;br&gt;
The semantic gap — the space between what an agent intended and what it executed — is invisible to infrastructure APM. I wrote about this last week using Sherlocks.ai's research: existing tools observe high-level intent or low-level actions, not the correlation between them.&lt;br&gt;
Ask any vendor: do you track re-planning cycles per task? Can I see how many times the agent changed its approach before completing or escalating? Can I query that history after an incident?&lt;br&gt;
If the answer is "we log prompts and tool calls," that's Layer 1 observability. Useful, necessary, insufficient. You need Layer 3 — one structured record per agent task showing the full decision sequence.&lt;br&gt;
What to look for in a demo: ask them to show you a failed task trace. Does it show you the sequence of re-planning decisions, or just the final outcome and the spans?&lt;br&gt;
Question 2: What is the Human Escalation Rate in their benchmark?&lt;br&gt;
HER — the fraction of agent decisions that escalated to human judgment — is the most honest single metric for how autonomous a tool actually is. A low MTTR number paired with a high HER means humans were doing most of the resolution work, faster because the agent assembled context for them. That's valuable. It's not the same as autonomous remediation.&lt;br&gt;
Ask: in your benchmark environment, what percentage of incidents did the agent resolve without human action? What percentage required human approval before execution? What triggered escalation most often?&lt;br&gt;
These questions reveal whether the tool is an autonomous remediator or a very good assistant. Both are legitimate. Only one of them matches the vendor's headline claim.&lt;br&gt;
Question 3: Does it check SLO state before acting?&lt;br&gt;
An agent that remediates without checking your current error budget can compound a degraded situation. I formalized this in the Pre-Action SRE Gate (Post 13): three checks before any autonomous action — error budget remaining, AQDD state, and the agent's own HER trend.&lt;br&gt;
Ask any vendor: does your agent check SLO error budget before executing a remediation? What happens if the error budget is critically low — does it act anyway or escalate? Can I configure the pre-action gate thresholds?&lt;br&gt;
A tool that doesn't have an answer to this question is not safe for production systems where the error budget is already burning.&lt;br&gt;
Question 4: What is the defined blast radius per agent?&lt;br&gt;
Komodor's Klaudia is trained specifically on pod crashes, failed rollouts, autoscaler friction, misconfigurations, and cascading failures in Kubernetes environments. That specificity is its blast radius. 95% accuracy in that domain does not mean 95% accuracy outside it. Yisusvii&lt;br&gt;
Every AI SRE tool has an implicit blast radius — the set of systems and failure modes it was trained and tested on. Good tools make this explicit. Ask: what systems can this agent modify autonomously? What systems are write-locked? What failure categories is the accuracy claim based on?&lt;br&gt;
If the vendor can't give you a concrete blast radius definition, the accuracy number is a marketing claim. If they can, you can evaluate whether that blast radius covers your actual failure distribution.&lt;br&gt;
Question 5: What is the ownership model when it's wrong?&lt;br&gt;
This is the question vendors like least. When the agent makes a bad remediation decision and compounds the incident, who is accountable? The vendor's SLA covers service availability, not the operational consequences of an agent action.&lt;br&gt;
In your environment, the answer should map to your ARO (Agent Reliability Ownership) registration — a named human owner, a defined escalation path, and an audit log of every gate check the agent ran before acting.&lt;br&gt;
Ask any vendor: does your tool generate an audit log of agent decision reasoning before each action? Is that log queryable during incident review? Who owns the agent's behavior in my environment?&lt;br&gt;
If the audit log doesn't exist, you cannot write a complete postmortem after an agent-involved incident. That's the accountability gap that makes autonomous agents unsafe in regulated production environments.&lt;br&gt;
The Build vs Buy Decision Matrix&lt;br&gt;
Given these five questions, here's how I'd frame the build-vs-buy decision:&lt;br&gt;
Buy if: Your failure distribution maps closely to the tool's blast radius, you don't need custom SLIs beyond what the vendor provides, and the vendor can answer all five questions with specifics.&lt;br&gt;
Build if: Your failure distribution is broad or novel, you need custom SLIs (DQR, RTD, HER, AQDD are all absent from commercial tools today), or you need to satisfy regulatory requirements that mandate audit trails the vendor doesn't generate.&lt;br&gt;
Hybrid (most realistic): Buy the investigation layer — vendor tools are genuinely good at assembling incident context faster than humans. Build the governance layer — Pre-Action Gates, ARO registration, Semantic Gap detection, Sprawl Registry. The agentsre library is designed for exactly this hybrid.&lt;br&gt;
The Evaluation Scorecard&lt;br&gt;
python# agentsre/tool_evaluation.py&lt;/p&gt;

&lt;p&gt;from dataclasses import dataclass, field&lt;br&gt;
from typing import Dict, List, Optional&lt;br&gt;
import json&lt;br&gt;
from datetime import datetime, timezone&lt;/p&gt;

&lt;p&gt;@dataclass&lt;br&gt;
class ToolEvaluationScore:&lt;br&gt;
    """&lt;br&gt;
    Five-question evaluation scorecard for AI SRE tooling.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Use this to evaluate commercial tools or internal builds
against the SLI framework from the agentsre series.

Score each question 0 (no), 1 (partial), 2 (yes).
Total score &amp;gt;= 8: consider for production.
Total score 5-7: pilot with governance layer built separately.
Total score &amp;lt; 5: not production-ready for autonomous operation.
"""
tool_name: str
evaluator: str
environment_context: str  # Brief description of your stack

# Q1: Reasoning layer instrumentation
tracks_replanning_cycles: int = 0    # 0/1/2
can_query_decision_sequence: int = 0
q1_notes: str = ""

# Q2: HER transparency
her_in_benchmark_disclosed: int = 0
autonomous_vs_assisted_split_disclosed: int = 0
q2_notes: str = ""

# Q3: Pre-action SLO gate
checks_error_budget_before_acting: int = 0
gate_thresholds_configurable: int = 0
q3_notes: str = ""

# Q4: Blast radius definition
blast_radius_explicit: int = 0
accuracy_claim_scoped_to_blast_radius: int = 0
q4_notes: str = ""

# Q5: Ownership and audit
generates_decision_audit_log: int = 0
audit_log_queryable_postmortem: int = 0
q5_notes: str = ""

evaluated_at: str = field(
    default_factory=lambda: datetime.now(timezone.utc).isoformat()
)

@property
def total_score(self) -&amp;gt; int:
    return (
        self.tracks_replanning_cycles +
        self.can_query_decision_sequence +
        self.her_in_benchmark_disclosed +
        self.autonomous_vs_assisted_split_disclosed +
        self.checks_error_budget_before_acting +
        self.gate_thresholds_configurable +
        self.blast_radius_explicit +
        self.accuracy_claim_scoped_to_blast_radius +
        self.generates_decision_audit_log +
        self.audit_log_queryable_postmortem
    )

@property
def recommendation(self) -&amp;gt; str:
    if self.total_score &amp;gt;= 8:
        return "CONSIDER: meets production governance bar"
    elif self.total_score &amp;gt;= 5:
        return "PILOT: build governance layer separately before production"
    else:
        return "NOT READY: missing critical governance capabilities"

def to_report(self) -&amp;gt; Dict:
    return {
        "tool": self.tool_name,
        "evaluator": self.evaluator,
        "environment": self.environment_context,
        "scores": {
            "q1_reasoning_layer": {
                "tracks_replanning": self.tracks_replanning_cycles,
                "queryable_decision_sequence": self.can_query_decision_sequence,
                "notes": self.q1_notes
            },
            "q2_her_transparency": {
                "her_disclosed": self.her_in_benchmark_disclosed,
                "autonomous_split_disclosed": self.autonomous_vs_assisted_split_disclosed,
                "notes": self.q2_notes
            },
            "q3_pre_action_gate": {
                "checks_error_budget": self.checks_error_budget_before_acting,
                "configurable_thresholds": self.gate_thresholds_configurable,
                "notes": self.q3_notes
            },
            "q4_blast_radius": {
                "explicit_definition": self.blast_radius_explicit,
                "accuracy_scoped": self.accuracy_claim_scoped_to_blast_radius,
                "notes": self.q4_notes
            },
            "q5_audit_ownership": {
                "audit_log_generated": self.generates_decision_audit_log,
                "queryable_in_postmortem": self.audit_log_queryable_postmortem,
                "notes": self.q5_notes
            }
        },
        "total_score": f"{self.total_score}/20",
        "recommendation": self.recommendation,
        "evaluated_at": self.evaluated_at
    }

def to_json(self) -&amp;gt; str:
    return json.dumps(self.to_report(), indent=2)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Where This Fits in the Arc&lt;br&gt;
Posts 1–14 built the measurement framework: SLIs for agent output quality, control plane reliability, reasoning observability, context management, ownership governance, semantic gap detection.&lt;br&gt;
Post 15 is the practical payoff — you now have a five-question framework, grounded in production SLIs, to evaluate any AI SRE tool your manager asks you to assess. Whether the answer is buy, build, or hybrid, the framework gives you a defensible, technically grounded recommendation.&lt;br&gt;
The ToolEvaluationScore dataclass is in agentsre/tool_evaluation.py. Use it to document your evaluation and generate a report you can share with your team.&lt;br&gt;
Ajay Devineni | AWS Community Builder | Senior SRE/Platform Engineer&lt;br&gt;
github.com/Ajay150313/agentsre | dev.to/ajaydevineni&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F650n1vy2sob2h1wdqhlb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F650n1vy2sob2h1wdqhlb.png" alt=" " width="800" height="1000"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agentaichallenge</category>
      <category>devops</category>
      <category>automation</category>
    </item>
    <item>
      <title>The AI Pilot-to-Production Gap Is an SRE Problem And We Already Know How to Close It</title>
      <dc:creator>Ajay Devineni</dc:creator>
      <pubDate>Wed, 03 Jun 2026 02:01:55 +0000</pubDate>
      <link>https://dev.to/ajaydevineni/the-ai-pilot-to-production-gap-is-an-sre-problem-and-we-already-know-how-to-close-it-50el</link>
      <guid>https://dev.to/ajaydevineni/the-ai-pilot-to-production-gap-is-an-sre-problem-and-we-already-know-how-to-close-it-50el</guid>
      <description>&lt;p&gt;A startup raised $50M this week to help companies move AI out of stalled pilots and into production. Investors called it "the defining gap of 2026." Salesforce published that "getting agents to run reliably in production" is the common thread behind every significant AI engineering breakthrough this year.&lt;/p&gt;

&lt;p&gt;Both are right about the problem. Neither named the solution.&lt;/p&gt;

&lt;p&gt;The AI pilot-to-production gap is not a new kind of problem. It is a very old kind of problem wearing a new coat. The SRE discipline has been closing this exact gap — for distributed systems, for microservices, for Kubernetes — for two decades. The tools exist. The frameworks are documented. What's missing is the organizational willingness to apply them to AI before the first production incident instead of after.&lt;/p&gt;

&lt;p&gt;This article is about what that actually looks like.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Pilots Feel Production-Ready (And Aren't)
&lt;/h2&gt;

&lt;p&gt;An AI agent demo in a sandbox is a controlled environment. The data is clean. The tools respond predictably. The task volume is low. The team running the demo knows the system well enough to guide it toward success.&lt;/p&gt;

&lt;p&gt;Production is different in every way that matters:&lt;/p&gt;

&lt;p&gt;Real data has edge cases the sandbox never saw. Tools fail, return ambiguous responses, or change their APIs. Task volume spikes at the worst possible time. The team running the system during an incident at 2am is not the team that built the demo.&lt;/p&gt;

&lt;p&gt;The gap between those two environments is not an AI problem. It is a reliability engineering problem. And it has a well-known set of solutions.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Three Things Most Teams Skip
&lt;/h2&gt;

&lt;p&gt;After studying numerous production AI agent deployments across regulated industries, I've identified three reliability discipline components that are almost universally absent when a pilot fails to survive contact with production:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. An SLO defined before go-live
&lt;/h3&gt;

&lt;p&gt;The single most common failure mode in AI pilot-to-production transitions is deploying without a defined success criteria.&lt;/p&gt;

&lt;p&gt;What does reliable operation look like for this agent? What is the acceptable escalation rate? The acceptable decision quality drift? The acceptable tool invocation efficiency?&lt;/p&gt;

&lt;p&gt;These are the agent's SLIs. Without defining them before deployment, there is no way to know whether the agent is performing within acceptable bounds — until a user reports a problem.&lt;/p&gt;

&lt;p&gt;In traditional SRE practice, you don't ship a service without an SLO. The agent is a service. The same rule applies.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agentsre&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AgentSLICollector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TaskRecord&lt;/span&gt;

&lt;span class="c1"&gt;# Define these BEFORE go-live, not after the first incident
&lt;/span&gt;&lt;span class="n"&gt;SLO_TARGETS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;decision_quality_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;85.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="c1"&gt;# DQR: % decisions within behavioral bounds
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_invocation_efficiency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;1.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# TIE: max drift from baseline (multiplier)
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;human_escalation_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;5.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="c1"&gt;# HER: % tasks requiring human intervention
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;collector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AgentSLICollector&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# After each task:
&lt;/span&gt;&lt;span class="n"&gt;collector&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;record&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;TaskRecord&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;task_class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer-routing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tool_calls&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;actual_tool_calls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;decision_confidence&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model_confidence_score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;required_escalation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;task_needed_human&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;completed&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Check breach against pre-defined SLO
&lt;/span&gt;&lt;span class="n"&gt;breaches&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;collector&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;breached&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer-routing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;breaches&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;breaches&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;alert_oncall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;alert_message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. A named owner assigned before go-live
&lt;/h3&gt;

&lt;p&gt;"The AI team owns it" is not an ownership model. It is a responsibility diffusion pattern. When an AI agent degrades at 2am, "the AI team" does not have a pager.&lt;/p&gt;

&lt;p&gt;Before any AI agent goes to production, one named person must be assigned as the agent's Service Reliability Owner. That person:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Receives the page when the agent's SLO breaches&lt;/li&gt;
&lt;li&gt;Owns the runbook for known failure modes&lt;/li&gt;
&lt;li&gt;Reviews the agent's SLI report weekly&lt;/li&gt;
&lt;li&gt;Approves any change to the agent's autonomous permission scope&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the same accountability model that applies to every production microservice. The agent is not exempt because it's AI. The agent is not exempt because it's new. The exception is never justified in SRE practice, and it shouldn't be here.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. A runbook written before go-live
&lt;/h3&gt;

&lt;p&gt;A runbook for an AI agent does not need to be long. It needs to answer four questions:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Detection:&lt;/strong&gt; Which metric tells you the agent is degrading? (Answer: whichever of DQR, TIE, HER, or AQDD breaches first — not latency or error rate, which won't surface semantic failures)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Attribution:&lt;/strong&gt; How do you determine whether the degradation is the agent's behavior, the tools it's calling, or a code change in the agent's environment? (Answer: compare against pre-deployment behavioral baselines)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Containment:&lt;/strong&gt; What is the fastest path to reducing blast radius while you investigate? (Answer: the progressive autonomy constraint ladder — reduce permissions level by level, don't binary-kill the agent)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recovery:&lt;/strong&gt; What does returning to normal operation look like, and how do you know you're there? (Answer: SLI metrics returning to within 10% of pre-incident baselines for 30 consecutive minutes)&lt;/p&gt;

&lt;p&gt;Two hours to write. Six hours saved on the first incident.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the $50M Is Actually Buying
&lt;/h2&gt;

&lt;p&gt;The startup that raised $50M to close the pilot-to-production gap is selling tooling that helps teams implement governance, monitoring, and reliability structures for AI deployments.&lt;/p&gt;

&lt;p&gt;The governance, monitoring, and reliability structures themselves are not new. They are SRE. They are documented. They are open-source.&lt;/p&gt;

&lt;p&gt;What the money buys is the product layer that makes it easier for teams without SRE expertise to apply them. That's a legitimate service. But for teams with SRE expertise, the foundations are already there.&lt;/p&gt;

&lt;p&gt;Instrument your agent's behavioral SLIs. Define targets before deployment. Assign a named owner. Write the runbook. Run a tabletop exercise for your top two failure scenarios before go-live.&lt;/p&gt;

&lt;p&gt;That is the pilot-to-production gap, closed. Not with $50M. With process.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Pattern That Keeps Repeating
&lt;/h2&gt;

&lt;p&gt;The SRE community has seen this pattern before.&lt;/p&gt;

&lt;p&gt;Microservices: teams deployed distributed services without SLOs or ownership models. Incidents happened. The SRE discipline developed the governance layer and production stabilized.&lt;/p&gt;

&lt;p&gt;Kubernetes: teams deployed container orchestration without runbooks or blast radius models. Incidents happened. The SRE discipline developed the governance layer and production stabilized.&lt;/p&gt;

&lt;p&gt;AI agents: teams are deploying autonomous systems without SLOs, owners, or runbooks. Incidents are happening. The SRE discipline has the governance layer ready.&lt;/p&gt;

&lt;p&gt;The question is whether teams apply it before or after the incidents.&lt;/p&gt;

&lt;p&gt;Salesforce is right that the biggest 2026 AI engineering breakthroughs revolve around production reliability. Every one of those breakthroughs will, on inspection, be a form of SRE discipline applied to a new layer of the stack.&lt;/p&gt;

&lt;p&gt;It was always this. It is this now.&lt;/p&gt;




&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;Before your next AI agent goes to production, answer these five questions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What are this agent's SLIs and SLO targets?&lt;/li&gt;
&lt;li&gt;Who is the named owner whose pager fires when the SLO breaches?&lt;/li&gt;
&lt;li&gt;What does the runbook say for the top two failure modes?&lt;/li&gt;
&lt;li&gt;What is the blast radius if the agent makes a wrong autonomous decision?&lt;/li&gt;
&lt;li&gt;Have you run a tabletop exercise for the 2am incident scenario?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If any answer is "we haven't figured that out yet" — the agent is not production-ready. It is demo-ready.&lt;/p&gt;

&lt;p&gt;Open-source SLI framework: &lt;a href="https://github.com/Ajay150313/agentsre" rel="noopener noreferrer"&gt;https://github.com/Ajay150313/agentsre&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;LinkedIn discussion: &lt;a href="https://www.linkedin.com/posts/ajay-devineni_sre-agenticai-reliability-ugcPost-7467310392701198336-ckGw/?utm_source=share&amp;amp;utm_medium=member_desktop&amp;amp;rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU" rel="noopener noreferrer"&gt;https://www.linkedin.com/posts/ajay-devineni_sre-agenticai-reliability-ugcPost-7467310392701198336-ckGw/?utm_source=share&amp;amp;utm_medium=member_desktop&amp;amp;rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What's the one reliability discipline component most teams skip when moving AI agents to production — in your experience?&lt;/p&gt;

</description>
      <category>sre</category>
      <category>aws</category>
      <category>devops</category>
      <category>agentaichallenge</category>
    </item>
    <item>
      <title>The Semantic Gap Why Your APM Sees the Agent But Misses the Decision, and What RTD Does About It</title>
      <dc:creator>Ajay Devineni</dc:creator>
      <pubDate>Tue, 02 Jun 2026 13:22:52 +0000</pubDate>
      <link>https://dev.to/ajaydevineni/the-semantic-gap-why-your-apm-sees-the-agent-but-misses-the-decision-and-what-rtd-does-about-it-3lc7</link>
      <guid>https://dev.to/ajaydevineni/the-semantic-gap-why-your-apm-sees-the-agent-but-misses-the-decision-and-what-rtd-does-about-it-3lc7</guid>
      <description>&lt;p&gt;Sherlocks.ai published something yesterday that names a problem precisely.&lt;br&gt;
The core problem: traditional APM was built for synchronous request response. Agents break that model entirely, and most observability platforms are stitching together legacy APM rather than observing agents as a distinct thing. If your observability stack cannot correlate an agent's intended action with what actually happened at the system level, you are flying blind through the exact moments when cost and risk concentrate. Sherlocks AI&lt;br&gt;
They call it the semantic gap. I've been building toward this from a different direction across this series starting with RTD (Reasoning Trace Depth) in Post 11 and the Pre-Action SRE Gate in Post 13. This post is where those frameworks connect to the industry's emerging framing.What the Semantic Gap Actually Is&lt;br&gt;
Existing tools observe an agent's high-level intent — prompts, tool selections — or its low-level actions — system calls, API hits, latency. They do not correlate both views. You can see the LLM prompt and you can see the system call, but you cannot see whether the agent intended that exact action or reasoned its way to something unexpected. When failure happens, this gap becomes your investigation crater. Sherlocks AI&lt;br&gt;
The gap lives in the decision sequence — what happened between the prompt and the system call. Every re-plan, every tool evaluation, every "this result doesn't match what I expected so I'll try differently" — all of that is invisible to APM because APM instruments execution, not reasoning.&lt;br&gt;
Five percent of AI model requests fail in production today. Roughly sixty percent of those are capacity-related, not model errors. Which means the majority of production failures aren't the model doing something wrong. They're the infrastructure around the model — tool availability, API response times, token budget, context state — creating conditions the agent can't navigate cleanly. And your observability stack is optimized to catch model errors. Sherlocks AI&lt;br&gt;
You're instrumented for the minority failure mode.&lt;br&gt;
How RTD Closes the Semantic Gap&lt;br&gt;
Reasoning Trace Depth is a single structured log entry per agent task — not per tool call. It captures:&lt;/p&gt;

&lt;p&gt;What the agent planned to do initially&lt;br&gt;
Every re-plan event: why, which tool triggered it, what the new plan was&lt;br&gt;
How many cycles before completion or escalation&lt;br&gt;
Whether HER fired at the end&lt;/p&gt;

&lt;p&gt;That record is the intent-to-action correlation layer. It sits above your OTel spans (low-level execution) and below your business metrics (outcome). It's the semantic layer that connects "agent received this task" to "here's exactly how the decision sequence played out."&lt;br&gt;
Without RTD, your investigation after a production failure looks like this: agent ran, spans look clean, outcome was bad, no idea what the agent decided between the tool calls.&lt;br&gt;
With RTD, it looks like this: agent re-planned 4 times, tool 3 returned stale data on every attempt, HER fired at re-plan 5, here is the full decision sequence with timestamps.&lt;br&gt;
That second version is a postmortem. The first is a guess.&lt;br&gt;
What the Market Is Getting Right and Missing&lt;br&gt;
Fifteen tools actively compete on agent observability in 2026, most built on OpenTelemetry standards. The critical test for any of them: does it handle reasoning loops as a first-class concern? Can you see the decision tree — prompt, tool choice, outcome, next decision — as a continuous trace? Does it distinguish between a tool failure and an agent misunderstanding? Does it alert on semantic drift, where agent behavior changes but metrics look normal? Sherlocks AI&lt;br&gt;
Those are the right questions. Most tools fail at least two of them because they were designed as APM add-ons, not as reasoning-native observability.&lt;br&gt;
The practical implication: even if you adopt a good commercial agent observability tool, you still need the reasoning trace layer. Commercial tools give you the infrastructure view. RTD gives you the decision view. You need both.&lt;br&gt;
The Three-Layer Stack, Restated&lt;br&gt;
I've been building this framing across the series. The Sherlocks piece clarifies why it matters:&lt;br&gt;
Layer 1 — Infrastructure (APM, OTel, CloudWatch)&lt;br&gt;
What executed. Tool call latency, error rates, span data. Answers: did the tools work? Misses: did the agent reason correctly?&lt;br&gt;
Layer 2 — Control Plane (RAR, RSI, DCS from Post 7)&lt;br&gt;
How the orchestration behaved. Routing accuracy, retry patterns, task decomposition. Answers: did the workflow hold up? Misses: what was the agent deciding inside each task?&lt;br&gt;
Layer 3 — Reasoning (RTD from Post 11)&lt;br&gt;
What the agent decided. Re-plan count, tool sequence, decision rationale, HER correlation. Answers: did the reasoning hold up? This is the semantic gap layer.&lt;br&gt;
If you are buying observability tooling, demand explicit agent loop tracking. Ask for examples. Do not accept "we can log prompts" as an answer. Sherlocks AI&lt;br&gt;
Logging prompts is Layer 1. You need Layer 3.&lt;br&gt;
The Postmortem Template Addition&lt;br&gt;
Every postmortem for an agent-involved incident should now have a section that didn't exist before: Semantic Gap Analysis.&lt;br&gt;
Three fields:&lt;br&gt;
Intent vs. outcome delta — what did the agent plan to do vs. what did it actually do? If these match, the reasoning held. If they diverge, you have a semantic gap event.&lt;br&gt;
Re-plan sequence — RTD value, re-plan reasons, which tools triggered each re-plan. This is where you find the actual root cause in most agent failures.&lt;br&gt;
HER correlation — did HER spike during this task? At which re-plan decision? That's the moment the agent recognized it was outside its reliable envelope.&lt;br&gt;
Without these three fields, your postmortem explains what broke. It can't explain why the agent did what it did before the break.&lt;br&gt;
Where This Fits in the Arc&lt;br&gt;
Post 4: SLOs for agents (DQR, TIE, HER, AQDD) — what to measure.&lt;br&gt;
Post 7: Control plane SLIs (RAR, RSI, DCS) — where Layer 2 lives.&lt;br&gt;
Post 11: RTD — the Layer 3 reasoning primitive.&lt;br&gt;
Post 13: Pre-Action Gate — using SLIs as authorization signals.&lt;br&gt;
Post 14: The semantic gap — why all three layers are necessary and what happens without Layer 3.&lt;br&gt;
The industry is arriving at this independently. The frameworks were already here.&lt;br&gt;
Ajay Devineni | AWS Community Builder | Senior SRE/Platform Engineer&lt;br&gt;
github.com/Ajay150313/agentsre&lt;/p&gt;

</description>
      <category>sre</category>
      <category>agentaichallenge</category>
      <category>aws</category>
      <category>devops</category>
    </item>
    <item>
      <title>Your Agent Acts Without Checking Your Error Budget — That's the Failure Mode Nobody Is Tracking</title>
      <dc:creator>Ajay Devineni</dc:creator>
      <pubDate>Tue, 26 May 2026 17:34:07 +0000</pubDate>
      <link>https://dev.to/ajaydevineni/your-agent-acts-without-checking-your-error-budget-thats-the-failure-mode-nobody-is-tracking-29n0</link>
      <guid>https://dev.to/ajaydevineni/your-agent-acts-without-checking-your-error-budget-thats-the-failure-mode-nobody-is-tracking-29n0</guid>
      <description>&lt;p&gt;Yesterday a piece came out that framed something I've been watching build across production environments for months.&lt;br&gt;
There is a category of production incident that engineering teams are not tracking yet — because it doesn't fit any existing postmortem template. The agent initiated an action. The action was technically correct given the agent's context. The context was incomplete. The infrastructure cascaded. By the time the incident review happened, three teams were arguing about whether it was an agent failure or an infrastructure failure. Kore.ai&lt;br&gt;
That argument happens because the two disciplines — SRE and autonomous agents — have never been formally connected at the decision layer.&lt;br&gt;
Here's the connection I want to make explicit.&lt;br&gt;
What Chaos Engineering Gets Right&lt;br&gt;
Mature chaos engineering programs have a property that's easy to overlook because it's invisible when it's working. Before a human engineer initiates any experiment — a fault injection, a latency spike, a dependency kill — they make a judgment call: does this system have capacity to absorb a perturbation right now?&lt;br&gt;
They check error budget burn rate. They look at whether upstream dependencies are stable. They assess whether the on-call team has bandwidth to respond if something goes wrong. They check whether there's a deploy in flight that makes this a bad time.&lt;br&gt;
That judgment call is informal, often intuitive, and sometimes wrong. But it exists. It's the human-in-the-loop that decides whether the system is in a state to safely absorb autonomous action.&lt;br&gt;
Agents don't make that call. They evaluate their task context, form a plan, and execute. The question "is right now a safe time for this action given the current reliability state of the system?" is not in their decision loop.&lt;br&gt;
The agents delivering production value in 2026 share one defining property: bounded scope. The agent handles one domain, with a defined tool set, and explicitly refuses tasks outside that boundary. The boundary is what makes autonomous deployment safe. GlobeNewswire&lt;br&gt;
Boundary on task scope is necessary. It's not sufficient. You also need a boundary on timing — a gate that checks whether the system's current reliability state can absorb what the agent is about to do.&lt;br&gt;
The Pre-Action SRE Gate&lt;br&gt;
I want to introduce a concrete pattern here: the Pre-Action SRE Gate — a check an agent runs against your existing SRE signals before executing any state-changing action.&lt;br&gt;
The gate has three checks, all using metrics I've built out across this series:&lt;br&gt;
Check 1 — Error Budget Headroom&lt;br&gt;
Before acting, the agent queries current SLO error budget remaining for the services in its blast radius. If error budget is below threshold — the system is already burning faster than acceptable — the agent does not act autonomously. It escalates.&lt;br&gt;
This is the chaos engineering judgment call, formalized as a programmatic check.&lt;br&gt;
Check 2 — AQDD State&lt;br&gt;
Approval Queue Depth Drift tells you whether the human oversight layer is already backed up. If AQDD is elevated — meaning humans can't process approvals fast enough — autonomous action during that window means any mistake won't be caught in time. Agent holds.&lt;br&gt;
Check 3 — HER Trend&lt;br&gt;
If the agent's own Human Escalation Rate has been elevated in the recent window, it's operating outside its reliable envelope. Letting it take autonomous action in that state compounds the risk. Agent escalates.&lt;br&gt;
None of these metrics are new. They're from Post 4 and Post 10 of this series. What's new is using them as gates before action, not just as observability signals after the fact.&lt;br&gt;
python# agentsre/pre_action_gate.py&lt;/p&gt;

&lt;p&gt;from dataclasses import dataclass&lt;br&gt;
from typing import Optional&lt;br&gt;
from datetime import datetime, timezone&lt;br&gt;
import json&lt;/p&gt;

&lt;p&gt;@dataclass&lt;br&gt;
class SREGateResult:&lt;br&gt;
    """&lt;br&gt;
    Result of a Pre-Action SRE Gate check.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;If approved is False, the agent must not proceed with
autonomous action — escalate to human owner per ARO record.

Attributes:
    approved: Whether autonomous action is cleared
    blocking_check: Which check blocked (if any)
    error_budget_pct: Current error budget remaining (0-100)
    aqdd_depth: Current approval queue depth
    her_trend: Recent HER rate (0-100)
    recommendation: What the agent should do
    checked_at: Timestamp of gate check
"""
approved: bool
blocking_check: Optional[str]
error_budget_pct: float
aqdd_depth: int
her_trend: float
recommendation: str
checked_at: str
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;class PreActionSREGate:&lt;br&gt;
    """&lt;br&gt;
    Pre-Action SRE Gate — checks your SRE signal state before&lt;br&gt;
    an agent executes any autonomous write, remediation, scale&lt;br&gt;
    event, or config change.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;This is the chaos engineering judgment call, formalized.
A human engineer checks these things before running an experiment.
Your agent should check them before acting autonomously.

Thresholds should be calibrated per agent and task class
in shadow mode — same protocol as HER and RTD baselines.
"""

def __init__(self,
             error_budget_min_pct: float = 20.0,
             aqdd_max_depth: int = 3,
             her_max_trend_pct: float = 15.0):
    """
    Args:
        error_budget_min_pct: Minimum error budget % required
            for autonomous action. Below this = escalate.
            Default 20% — agent should not consume budget
            that's already critically low.
        aqdd_max_depth: Max approval queue depth before
            autonomous action is blocked. Above this,
            humans can't course-correct fast enough.
        her_max_trend_pct: Max recent HER rate before
            autonomous action is blocked. Elevated HER
            means agent is already outside reliable envelope.
    """
    self.error_budget_min_pct = error_budget_min_pct
    self.aqdd_max_depth = aqdd_max_depth
    self.her_max_trend_pct = her_max_trend_pct

def check(self,
          agent_id: str,
          intended_action: str,
          error_budget_pct: float,
          aqdd_depth: int,
          her_trend_pct: float) -&amp;gt; SREGateResult:
    """
    Run pre-action SRE gate check.

    Call this before any autonomous state-changing action.
    If result.approved is False — escalate, do not act.

    Args:
        agent_id: Agent requesting action clearance
        intended_action: Description of what agent plans to do
        error_budget_pct: Current error budget remaining (0-100)
        aqdd_depth: Current approval queue depth
        her_trend_pct: Agent's recent HER rate (0-100)

    Returns:
        SREGateResult with approval decision and reasoning
    """
    # Check 1: Error budget headroom
    if error_budget_pct &amp;lt; self.error_budget_min_pct:
        return SREGateResult(
            approved=False,
            blocking_check="error_budget",
            error_budget_pct=error_budget_pct,
            aqdd_depth=aqdd_depth,
            her_trend=her_trend_pct,
            recommendation=(
                f"Error budget at {error_budget_pct:.1f}% — "
                f"below {self.error_budget_min_pct}% minimum. "
                "Escalate to human owner. Do not act autonomously."
            ),
            checked_at=datetime.now(timezone.utc).isoformat()
        )

    # Check 2: Approval queue state
    if aqdd_depth &amp;gt; self.aqdd_max_depth:
        return SREGateResult(
            approved=False,
            blocking_check="aqdd",
            error_budget_pct=error_budget_pct,
            aqdd_depth=aqdd_depth,
            her_trend=her_trend_pct,
            recommendation=(
                f"Approval queue depth {aqdd_depth} exceeds "
                f"maximum {self.aqdd_max_depth}. "
                "Human oversight is backed up — autonomous action "
                "cannot be safely course-corrected. Hold."
            ),
            checked_at=datetime.now(timezone.utc).isoformat()
        )

    # Check 3: Agent's own HER trend
    if her_trend_pct &amp;gt; self.her_max_trend_pct:
        return SREGateResult(
            approved=False,
            blocking_check="her_trend",
            error_budget_pct=error_budget_pct,
            aqdd_depth=aqdd_depth,
            her_trend=her_trend_pct,
            recommendation=(
                f"HER at {her_trend_pct:.1f}% — "
                f"above {self.her_max_trend_pct}% threshold. "
                "Agent is operating outside reliable envelope. "
                "Escalate rather than act autonomously."
            ),
            checked_at=datetime.now(timezone.utc).isoformat()
        )

    # All checks passed
    return SREGateResult(
        approved=True,
        blocking_check=None,
        error_budget_pct=error_budget_pct,
        aqdd_depth=aqdd_depth,
        her_trend=her_trend_pct,
        recommendation="Autonomous action cleared. Proceed within blast radius.",
        checked_at=datetime.now(timezone.utc).isoformat()
    )

def to_audit_log(self, agent_id: str,
                 intended_action: str,
                 result: SREGateResult) -&amp;gt; dict:
    """
    Structured audit log entry for every gate check.
    Every autonomous action attempt — approved or blocked —
    should be logged. This is your agent action audit trail.
    """
    return {
        "trace_type": "pre_action_gate",
        "agent_id": agent_id,
        "intended_action": intended_action,
        "gate_approved": result.approved,
        "blocking_check": result.blocking_check,
        "sre_signals": {
            "error_budget_pct": result.error_budget_pct,
            "aqdd_depth": result.aqdd_depth,
            "her_trend_pct": result.her_trend,
        },
        "recommendation": result.recommendation,
        "checked_at": result.checked_at,
    }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;How This Connects to the Full Arc&lt;br&gt;
Post 4 introduced DQR, TIE, HER, AQDD as observability SLIs — things you watch.&lt;br&gt;
Post 10 introduced ARO — who owns the agent when those SLIs breach.&lt;br&gt;
Post 11 introduced RTD — the reasoning observability layer.&lt;br&gt;
Post 12 introduced CUR — context budget as a reliability ceiling.&lt;br&gt;
This post introduces the Pre-Action SRE Gate — where all of those signals become decision inputs rather than observability outputs. The agent reads your SRE state before acting, not just after.&lt;br&gt;
Resilience requires explicit investment in circuit breakers, graceful degradation, and clear failure modes that preserve system integrity. Teams building agents must invest in resilience infrastructure before pushing to higher-criticality workloads. SourceForge&lt;br&gt;
The Pre-Action Gate is that infrastructure. It's your agent's circuit breaker — not on retry loops or cost, but on system-level reliability state.&lt;br&gt;
The Postmortem Template Gap&lt;br&gt;
79% of organizations now have AI agents in production. Gartner warns 40% of those projects will be canceled due to poor risk controls. The incidents happening in that gap don't fit existing postmortem templates because current templates ask: what changed? who deployed? what failed? Kore.ai&lt;br&gt;
They don't ask: what was the error budget state when the agent acted? Was AQDD elevated, meaning the approval layer was already overwhelmed? Had the agent's HER been trending up, meaning it was already in unreliable territory?&lt;br&gt;
Those questions need to be in your postmortem template. Add a section: Agent Pre-Action State — error budget at time of action, AQDD depth, HER trend. If your postmortem can't answer those three questions, you don't have the data to prevent the same incident from happening again.&lt;br&gt;
The code is in agentsre/pre_action_gate.py on GitHub. MIT licensed, zero external dependencies.&lt;br&gt;
Ajay Devineni | AWS Community Builder | Senior SRE/Platform Engineer&lt;br&gt;
github.com/Ajay150313/agentsre&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fah1uu32oi9dqse8tf02a.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fah1uu32oi9dqse8tf02a.jpeg" alt=" " width="427" height="640"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>sre</category>
      <category>devops</category>
      <category>cursor</category>
    </item>
    <item>
      <title>Why Your AI Agent Monitoring is Wrong (And How to Fix It)</title>
      <dc:creator>Ajay Devineni</dc:creator>
      <pubDate>Mon, 25 May 2026 11:35:24 +0000</pubDate>
      <link>https://dev.to/ajaydevineni/why-your-ai-agent-monitoring-is-wrong-and-how-to-fix-it-1b25</link>
      <guid>https://dev.to/ajaydevineni/why-your-ai-agent-monitoring-is-wrong-and-how-to-fix-it-1b25</guid>
      <description>&lt;p&gt;As I discussed in my SLO Design article, traditional reliability metrics fail for agentic AI systems. Now let's look at how to actually implement semantic monitoring in production.&lt;br&gt;
Your AI agent is running in production.&lt;br&gt;
HTTP 200. Uptime 99.9%. All dashboards are green.&lt;br&gt;
But it's making the wrong decision 30% of the time.&lt;br&gt;
Your monitoring won't tell you.&lt;br&gt;
The Gap&lt;br&gt;
I spent six months figuring this out the hard way. Traditional SRE monitoring measures infrastructure. Network latency. Error rates. Uptime. It's designed for services that crash when they break. But agents don't crash. They degrade. Slowly. Silently.&lt;br&gt;
An agent can be:&lt;/p&gt;

&lt;p&gt;94% accurate (still 94%) &lt;br&gt;
But losing confidence (0.92 to 0.41)&lt;br&gt;
Compensating by calling tools 3x more (1.1x to 3.1x)&lt;br&gt;
While humans reject more of its output (1% to 19%)&lt;br&gt;
As work piles up waiting for approval (8 to 340 items)&lt;/p&gt;

&lt;p&gt;Your monitoring sees "everything is fine."&lt;br&gt;
You see $2M impact by the time you notice.&lt;br&gt;
What We Actually Need to Measure Not infrastructure metrics. Semantic metrics.&lt;br&gt;
Four things:&lt;br&gt;
Decision Quality Rate &lt;strong&gt;(DQR)&lt;/strong&gt;&lt;br&gt;
Is the agent picking the right tool?&lt;br&gt;
Healthy: 92%+&lt;br&gt;
Threshold for action: &amp;lt;85%&lt;br&gt;
Tool Invocation Efficiency **(TIE)**&lt;br&gt;
Is it over-compensating by calling tools more than normal?&lt;br&gt;
Healthy: 1.0-1.2x baseline&lt;br&gt;
Threshold for action: &amp;gt;1.5x&lt;br&gt;
Human Escalation Rate &lt;strong&gt;(HER)&lt;/strong&gt;&lt;br&gt;
Are humans rejecting its decisions?&lt;br&gt;
Healthy: &amp;lt;2%&lt;br&gt;
Threshold for action: &amp;gt;5%&lt;br&gt;
Approval Queue Depth Drift (&lt;strong&gt;AQDD&lt;/strong&gt;)&lt;br&gt;
Is work piling up waiting for approval?&lt;br&gt;
Healthy: &amp;lt;20 pending&lt;br&gt;
Threshold for action: &amp;gt;50 pending&lt;br&gt;
When any of these drift, semantic failure is 48 hours away.&lt;br&gt;
Real Scenario&lt;br&gt;
Tuesday 2pm: Agent starts degrading. DQR drops from 94% to 88%. TIE increases from 1.1x to 1.4x. Nothing alarming yet by traditional metrics.&lt;br&gt;
Your infrastructure monitoring stays green.&lt;br&gt;
Thursday 10am: DQR at 62%. TIE at 3.1x. Queue at 340 items.&lt;br&gt;
Your first alert finally fires - from your infrastructure monitoring noticing error rates creeping up.&lt;br&gt;
You've just lost 40+ hours of bad decisions.&lt;br&gt;
With semantic SLIs, you would have known Tuesday at 2:15pm.&lt;br&gt;
How We Built It&lt;br&gt;
Semantic SLI monitoring system that:&lt;/p&gt;

&lt;p&gt;Tracks what matters - DQR, TIE, HER, AQDD (not uptime)&lt;br&gt;
Detects degradation early - 48 hours before traditional SLIs Suggests remediation - Not just "something's wrong" Automates response - Progressive autonomy constraints&lt;/p&gt;

&lt;p&gt;When degradation detected:&lt;/p&gt;

&lt;p&gt;Agent autonomy automatically constrained (FULL → GUIDED → SUPERVISED → BLOCKED)&lt;br&gt;
Slack notification sent with context&lt;br&gt;
Remediation steps suggested (prioritized by success rate)&lt;br&gt;
Everything tracked for audit and learning&lt;/p&gt;

&lt;p&gt;Code Example&lt;br&gt;
pythonfrom agentsre.orchestration import FintechSREOrchestrator, AgentRole, AlertManager&lt;/p&gt;

&lt;h1&gt;
  
  
  Initialize orchestrator
&lt;/h1&gt;

&lt;p&gt;orch = FintechSREOrchestrator()&lt;br&gt;
orch.register_agent("payment-1", AgentRole.PAYMENT_PROCESSOR)&lt;/p&gt;

&lt;h1&gt;
  
  
  Initialize alerts
&lt;/h1&gt;

&lt;p&gt;alerts = AlertManager()&lt;/p&gt;

&lt;p&gt;def on_critical_alert(alert_dict):&lt;br&gt;
    send_to_slack(alert_dict)&lt;/p&gt;

&lt;p&gt;alerts.slack_handler = on_critical_alert&lt;/p&gt;

&lt;h1&gt;
  
  
  Update metrics as agent runs
&lt;/h1&gt;

&lt;p&gt;orch.update_metrics(&lt;br&gt;
    agent_id="payment-1",&lt;br&gt;
    dqr=62.0,      # Decision quality degraded&lt;br&gt;
    tie=2.8,       # Tool calls increased&lt;br&gt;
    her=15.0,      # Escalations up&lt;br&gt;
    aqd=180,       # Queue growing&lt;br&gt;
    confidence=0.42,&lt;br&gt;
    cost=0.0003&lt;br&gt;
)&lt;/p&gt;

&lt;h1&gt;
  
  
  Create alert with remediation suggestions
&lt;/h1&gt;

&lt;p&gt;alert = alerts.create_alert(&lt;br&gt;
    agent_id="payment-1",&lt;br&gt;
    reason="Semantic degradation detected",&lt;br&gt;
    triggered_metrics=["DQR", "TIE", "HER"],&lt;br&gt;
    current_values={&lt;br&gt;
        "dqr": 62.0,&lt;br&gt;
        "tie": 2.8,&lt;br&gt;
        "her": 15.0,&lt;br&gt;
        "aqd": 180&lt;br&gt;
    }&lt;br&gt;
)&lt;/p&gt;

&lt;h1&gt;
  
  
  Get remediation steps
&lt;/h1&gt;

&lt;p&gt;for step in alert.suggested_remediations[:3]:&lt;br&gt;
    print(f"→ {step.action} ({step.estimated_time_minutes}min)")&lt;br&gt;
Output:&lt;br&gt;
→ Review latest 10 agent decisions - identify pattern (15min)&lt;br&gt;
→ Check upstream service - likely returning bad data (10min)&lt;br&gt;
→ Agent over-compensating - check confidence scores (10min)&lt;br&gt;
What This Means for SRE&lt;br&gt;
You're not just detecting problems. You're understanding them.&lt;br&gt;
Instead of:&lt;/p&gt;

&lt;p&gt;"Error rate is high"&lt;br&gt;
"Latency is up"&lt;br&gt;
"Something's wrong"&lt;/p&gt;

&lt;p&gt;You get:&lt;/p&gt;

&lt;p&gt;"Agent decision quality dropped 15%, tool calls increased 2.8x, humans rejecting 15% of output, 180 items pending approval"&lt;br&gt;
Suggested fix: Check upstream service (likely corrupting data)&lt;br&gt;
Severity: CRITICAL&lt;/p&gt;

&lt;p&gt;That's the difference between reactive and proactive reliability.&lt;br&gt;
Open Source Built all this open source. MIT licensed.&lt;br&gt;
Tested in production at scale. Works with LangChain, CrewAI, Bedrock.&lt;br&gt;
GitHub: &lt;a href="https://github.com/Ajay150313/agentsre" rel="noopener noreferrer"&gt;https://github.com/Ajay150313/agentsre&lt;/a&gt;&lt;br&gt;
For Your Team&lt;br&gt;
If you're running agents in production, you probably have this problem too. You just don't know it yet.&lt;br&gt;
Try semantic SLIs. If you catch something you didn't know was degrading (most teams do), you'll know it was worth it.&lt;br&gt;
The cost of not knowing? Sometimes it's $2M.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>sre</category>
      <category>devops</category>
      <category>fintech</category>
    </item>
    <item>
      <title>The Context Window Is RAM — Why Your Agent's SLIs Are Telling You It's Full</title>
      <dc:creator>Ajay Devineni</dc:creator>
      <pubDate>Fri, 22 May 2026 02:18:21 +0000</pubDate>
      <link>https://dev.to/ajaydevineni/the-context-window-is-ram-why-your-agents-slis-are-telling-you-its-full-4ejb</link>
      <guid>https://dev.to/ajaydevineni/the-context-window-is-ram-why-your-agents-slis-are-telling-you-its-full-4ejb</guid>
      <description>&lt;p&gt;The Microsoft team that built the Azure SRE Agent published something in January that I keep coming back to.&lt;br&gt;
Six months into building it, they realized they weren't building an SRE agent. They were building a context engineering system that happens to do site reliability engineering. Better models were table stakes, but what moved the needle was what they controlled: disciplined context management. Kore.ai&lt;br&gt;
That framing is exactly right. And it has a reliability implication that I haven't seen anyone write about directly.&lt;br&gt;
The Problem&lt;br&gt;
Your agent's context window is volatile working memory. Fast, expensive, and non-persistent. It's RAM, not storage. When the session ends, it's gone. When it fills up, quality degrades — not linearly, but in ways that are hard to predict and easy to miss.&lt;br&gt;
As you fill the context window, model quality drops non-linearly. "Lost in the middle," "not adhering to my instructions," and long-context degradation show up well before the advertised token limits. More tokens don't just cost latency — they quietly erode accuracy. Kore.ai&lt;br&gt;
That quiet erosion is the reliability failure mode. It doesn't throw an exception. It doesn't spike your error rate. Your agent keeps running. It just makes progressively worse decisions as the context fills.&lt;br&gt;
And here's the part I want to be specific about: you already have the SLIs to catch this. You just haven't connected them to context state yet.&lt;br&gt;
What Context Overflow Looks Like in Your SLIs&lt;br&gt;
When an agent's context fills beyond its effective working range, three things happen in order:&lt;br&gt;
DQR (Decision Quality Rate) drops first. The agent's decisions get worse because early instructions are now competing with thousands of tokens of recent tool output. An instruction from turn 3 gets buried under content that arrived after it — the agent isn't ignoring it, it's attending more reliably to recent content as the session grows. This is a passive decay process, not a model bug. incident.io&lt;br&gt;
RTD (Reasoning Trace Depth) climbs next. The agent re-plans more because its earlier context — what it already established about the problem — is partially decayed. It's not re-planning because something changed. It's re-planning because it partially forgot what it already figured out.&lt;br&gt;
TIE (Tool Invocation Efficiency) degrades last. The agent starts calling tools to reconstruct context it already had. It queries the same data sources again. It re-fetches runbooks it already read. Tool call count per task climbs above baseline while task quality continues to fall.&lt;br&gt;
By the time TIE is visibly elevated, you're already well into the degradation window. DQR was the earlier signal. And DQR dropping in a long-running session, without any external trigger, is your context overflow signature.&lt;br&gt;
The Architecture Fix&lt;br&gt;
Mem0's 2026 benchmarks quantify the difference clearly: full-context baseline (everything packed into the window) scored 72.9% accuracy using 26,000 tokens per query at 17 second p95 latency. A two-layer memory architecture scored 91.6% accuracy using under 7,000 tokens at 1.4 second p95 latency. That's an 18.7 point accuracy improvement while using 4x fewer tokens and cutting latency by 91%. Yahoo Finance&lt;br&gt;
The two-layer architecture is straightforward:&lt;br&gt;
Working memory (context window): Only what's needed for the current decision. Active task state, recent tool results, current instructions. Managed actively — compressed, summarized, or paged out as the session grows.&lt;br&gt;
Persistent memory (external store): Facts that persist across decisions and sessions. User preferences, established system state, prior investigation findings, runbook contents. Fetched into context when relevant, not kept resident the whole time.&lt;br&gt;
The discipline is knowing what belongs in each layer and managing the boundary actively.&lt;br&gt;
Connecting This to Your Production Readiness Checklist&lt;br&gt;
Before a long-running agent goes to production, two questions need answers:&lt;br&gt;
What is the expected context budget for a typical session? Not the model's maximum. The budget at which you've measured DQR starting to degrade for this specific agent on this specific task class. That number is your operational ceiling, not the advertised token limit.&lt;br&gt;
What happens when the agent approaches that ceiling? Does it compress? Summarize and page out? Escalate to human? Or does it silently continue with degrading accuracy until something downstream notices?&lt;br&gt;
If the answer to the second question is "it keeps going," that's your reliability gap. The context ceiling needs the same circuit breaker thinking as your token budget ceiling from the cost post.&lt;br&gt;
python# agentsre/context_budget.py&lt;/p&gt;

&lt;p&gt;from dataclasses import dataclass, field&lt;br&gt;
from typing import Optional&lt;br&gt;
import json&lt;br&gt;
from datetime import datetime, timezone&lt;/p&gt;

&lt;p&gt;@dataclass&lt;br&gt;
class ContextBudgetTracker:&lt;br&gt;
    """&lt;br&gt;
    Track context utilization against operational DQR ceiling.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;The model's advertised token limit is not your operational limit.
Your operational limit is the token count at which DQR starts
to degrade for this agent on this task class. Establish that
baseline in shadow mode. Set your ceiling below it.

Attributes:
    agent_id: Agent being tracked
    task_class: Task type (DQR ceiling varies by task complexity)
    operational_ceiling_tokens: Tokens at which DQR degrades
        for this agent/task combination. NOT the model's max.
    warning_threshold_pct: Fraction of ceiling triggering warning
    current_tokens: Current context utilization
"""
agent_id: str
task_class: str
operational_ceiling_tokens: int
warning_threshold_pct: float = 0.75
current_tokens: int = 0
session_id: str = ""
compression_events: int = 0

@property
def utilization_pct(self) -&amp;gt; float:
    """Current context utilization as fraction of operational ceiling."""
    return self.current_tokens / self.operational_ceiling_tokens

@property
def budget_status(self) -&amp;gt; str:
    """
    OK — within safe operating range
    WARNING — approaching DQR degradation ceiling
    CRITICAL — at or above operational ceiling, DQR degrading
    """
    u = self.utilization_pct
    if u &amp;lt; self.warning_threshold_pct:
        return "OK"
    elif u &amp;lt; 1.0:
        return "WARNING"
    return "CRITICAL"

def update(self, current_tokens: int) -&amp;gt; dict:
    """
    Update current context utilization and return status record.
    Call this after each tool call or model response.

    Returns status record for logging to CloudWatch / Datadog.
    """
    self.current_tokens = current_tokens
    record = {
        "agent_id": self.agent_id,
        "session_id": self.session_id,
        "task_class": self.task_class,
        "current_tokens": self.current_tokens,
        "operational_ceiling": self.operational_ceiling_tokens,
        "utilization_pct": round(self.utilization_pct, 3),
        "budget_status": self.budget_status,
        "compression_events": self.compression_events,
        "timestamp": datetime.now(timezone.utc).isoformat(),
    }
    return record

def record_compression(self) -&amp;gt; None:
    """Call when context compression or summarization fires."""
    self.compression_events += 1

def should_compress(self) -&amp;gt; bool:
    """True when context is approaching DQR degradation ceiling."""
    return self.utilization_pct &amp;gt;= self.warning_threshold_pct

def should_escalate(self) -&amp;gt; bool:
    """
    True when context is at or above operational ceiling.
    At this point DQR is actively degrading.
    Escalate to human or terminate session cleanly.
    """
    return self.utilization_pct &amp;gt;= 1.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The Practical Baseline Protocol&lt;br&gt;
Before you can set an operational context ceiling, you need to know where DQR actually starts to degrade for your specific agent on your specific task class. The steps:&lt;br&gt;
Run the agent in shadow mode on a representative sample of tasks. Record DQR at 25%, 50%, 75%, and 100% of the model's advertised context limit. Find the inflection point — where DQR starts dropping. Set your operational ceiling at 80% of that inflection point. That's your warning threshold. At the ceiling, trigger compression or escalation, not continuation.&lt;br&gt;
This is the same baseline protocol as HER and RTD. Thirty days of shadow mode, measure the metric, set the threshold. The only difference is that context budget degradation is session-scoped rather than task-scoped.&lt;br&gt;
Why This Post Belongs in This Series&lt;br&gt;
Post 4 established DQR as your output quality SLI. Post 9 established token budget as a cost circuit breaker. Post 11 introduced RTD as your reasoning observability layer.&lt;br&gt;
This post connects all three: context window mismanagement is the common cause that degrades DQR, elevates RTD, and burns your token budget simultaneously. Fix the memory architecture and you see improvement across all three SLIs. That's not a coincidence — they're measuring the same failure from different angles.&lt;br&gt;
The code is in agentsre/context_budget.py on GitHub. MIT licensed, zero external dependencies.&lt;br&gt;
Ajay Devineni | AWS Community Builder | Senior SRE/Platform Engineer&lt;br&gt;
github.com/Ajay150313/agentsre&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhv6pnufmeze84rbn9kur.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhv6pnufmeze84rbn9kur.jpeg" alt=" " width="427" height="640"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>sre</category>
      <category>devops</category>
      <category>azure</category>
    </item>
    <item>
      <title>Your OTel Traces Are Lying to You Observability for the Reasoning Layer</title>
      <dc:creator>Ajay Devineni</dc:creator>
      <pubDate>Tue, 19 May 2026 02:12:12 +0000</pubDate>
      <link>https://dev.to/ajaydevineni/your-otel-traces-are-lying-to-you-observability-for-the-reasoning-layer-2f7p</link>
      <guid>https://dev.to/ajaydevineni/your-otel-traces-are-lying-to-you-observability-for-the-reasoning-layer-2f7p</guid>
      <description>&lt;p&gt;Three weeks ago someone on the AWS Builders Slack posted something that stopped me cold. Their production AI agent had been running for six hours. CPU normal. Memory stable. Latency within SLO. Zero error rate in CloudWatch.&lt;br&gt;
The agent was re-planning on every single task. One tool kept returning stale data. The agent recognized it, switched tools, got a different failure, re-planned again. It completed tasks — slowly, expensively, with degrading output quality. Nothing in the dashboard moved.&lt;br&gt;
This is not an edge case. This is the default failure mode of agentic AI in production, and your current observability stack cannot see it.&lt;br&gt;
Why OTel Misses the Problem&lt;br&gt;
OpenTelemetry is the best thing that's happened to observability in a decade. Traces, metrics, logs — stable across all three signal types as of the 2026 CNCF milestone. Auto-instrumentation is production-grade. The ecosystem is mature.&lt;br&gt;
And for agent reasoning behavior, it is the wrong level of abstraction.&lt;br&gt;
OTel traces infrastructure execution. A trace shows you: this request arrived, it called this service, that service called this database, the database returned in 42ms, the response went back. Perfect for distributed systems.&lt;br&gt;
An agent doesn't execute a fixed call graph. An agent reasons. It evaluates state, picks a tool, observes the result, decides whether to continue or re-plan, picks another tool. The reasoning path is dynamic. The same input can produce different call graphs on different runs depending on what the tools return.&lt;br&gt;
The key shift is that once agent reasoning is exported into your observability stack, traces stop showing infrastructure execution and start showing reasoning behavior — but only if you're emitting the right data. Kore.ai&lt;br&gt;
Most teams aren't. They're emitting infrastructure spans. The reasoning is invisible.&lt;/p&gt;

&lt;p&gt;The Pattern: Silent Degradation via Re-Planning Loops&lt;br&gt;
Here's what silent agent degradation looks like in a trace when you're not capturing reasoning:&lt;br&gt;
span: agent-task-processor  duration: 4.2s  status: OK&lt;br&gt;
  span: tool-call-cloudwatch  duration: 0.8s  status: OK&lt;br&gt;
  span: tool-call-s3           duration: 0.3s  status: OK&lt;br&gt;
  span: tool-call-cloudwatch  duration: 0.8s  status: OK&lt;br&gt;
  span: tool-call-dynamodb     duration: 0.4s  status: OK&lt;br&gt;
Looks fine. Four tool calls, all successful, task completed.&lt;br&gt;
Here's what's actually happening:&lt;br&gt;
agent receives task&lt;br&gt;
→ plans: use CloudWatch metric X&lt;br&gt;
→ calls CloudWatch: returns stale data (tool succeeds, data is wrong)&lt;br&gt;
→ agent evaluates result: doesn't match expected state&lt;br&gt;
→ RE-PLANS: try DynamoDB instead&lt;br&gt;
→ calls DynamoDB: schema mismatch (tool succeeds, data wrong format)&lt;br&gt;
→ RE-PLANS: back to CloudWatch, different metric&lt;br&gt;
→ calls CloudWatch: stale again&lt;br&gt;
→ RE-PLANS: escalate to human&lt;br&gt;
Four successful spans. Two re-planning cycles. One HER escalation. Zero errors in your monitoring.&lt;br&gt;
This is your RSI (Retry Storm Index) in action — not at the HTTP retry level, but at the reasoning level.&lt;/p&gt;

&lt;p&gt;Introducing Reasoning Trace Depth&lt;br&gt;
I want to introduce a new observable to pair with RSI: Reasoning Trace Depth (RTD).&lt;br&gt;
RTD = the number of re-planning cycles an agent goes through before either completing a task or escalating.&lt;br&gt;
Baseline for a healthy agent on routine tasks: 0–1 re-planning cycles.&lt;br&gt;
Warning threshold: 3+ re-planning cycles.&lt;br&gt;
Critical threshold: 5+ re-planning cycles (agent is effectively stuck).&lt;br&gt;
RTD is your earliest signal. It rises before HER (because the agent is still trying before escalating), before latency becomes visible to users, and before cost metrics show anomalous spend.&lt;br&gt;
pythonfrom dataclasses import dataclass, field&lt;br&gt;
from typing import List, Optional&lt;br&gt;
import time&lt;/p&gt;

&lt;p&gt;@dataclass&lt;br&gt;
class AgentDecisionTrace:&lt;br&gt;
    """&lt;br&gt;
    Structured reasoning trace for a single agent task execution.&lt;br&gt;
    Emitted once per task — NOT once per tool call.&lt;br&gt;
    This is your reasoning observability layer.&lt;br&gt;
    """&lt;br&gt;
    agent_id: str&lt;br&gt;
    session_id: str&lt;br&gt;
    task_id: str&lt;br&gt;
    timestamp: str&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Reasoning behavior
initial_plan: str
tools_called: List[str] = field(default_factory=list)
replan_count: int = 0           # RTD — Reasoning Trace Depth
replan_reasons: List[str] = field(default_factory=list)

# Outcome
task_completed: bool = False
human_escalated: bool = False   # HER signal

# Cost signals
total_tool_calls: int = 0
latency_ms: int = 0

# Quality proxy (if available)
confidence_proxy: Optional[float] = None
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;def emit_decision_trace(trace: AgentDecisionTrace) -&amp;gt; dict:&lt;br&gt;
    """&lt;br&gt;
    Emit structured decision trace to your log aggregator.&lt;br&gt;
    This sits ABOVE your OTel infrastructure spans.&lt;br&gt;
    One entry per agent task — your reasoning observability layer.&lt;br&gt;
    """&lt;br&gt;
    record = {&lt;br&gt;
        "trace_type": "agent_decision",&lt;br&gt;
        "agent_id": trace.agent_id,&lt;br&gt;
        "session_id": trace.session_id,&lt;br&gt;
        "task_id": trace.task_id,&lt;br&gt;
        "timestamp": trace.timestamp,&lt;br&gt;
        "reasoning": {&lt;br&gt;
            "initial_plan": trace.initial_plan,&lt;br&gt;
            "replan_count": trace.replan_count,        # RTD&lt;br&gt;
            "replan_reasons": trace.replan_reasons,&lt;br&gt;
            "tools_sequence": trace.tools_called&lt;br&gt;
        },&lt;br&gt;
        "outcome": {&lt;br&gt;
            "completed": trace.task_completed,&lt;br&gt;
            "human_escalated": trace.human_escalated,  # HER&lt;br&gt;
        },&lt;br&gt;
        "cost": {&lt;br&gt;
            "tool_calls_total": trace.total_tool_calls,&lt;br&gt;
            "latency_ms": trace.latency_ms&lt;br&gt;
        }&lt;br&gt;
    }&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Flag for immediate attention
if trace.replan_count &amp;gt;= 3:
    record["alert"] = "RTD_WARNING"
if trace.replan_count &amp;gt;= 5:
    record["alert"] = "RTD_CRITICAL"

return record
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The Three-Layer Observability Model for Agents&lt;br&gt;
Your current stack has two layers. You need three.&lt;br&gt;
Layer 1 — Infrastructure (you already have this)&lt;br&gt;
OTel traces, Prometheus metrics, structured logs. Tool call latency, error rates, resource utilization. This is what Datadog, Grafana, and CloudWatch show you. It's correct and necessary. It just doesn't see reasoning.&lt;br&gt;
Layer 2 — Control Plane (from Post 7 — RAR, RSI, DCS)&lt;br&gt;
Routing accuracy, retry patterns at the orchestration level, decomposition quality. This is your agent behavior at the workflow level — are tasks being routed correctly? Is the orchestrator stable?&lt;br&gt;
Layer 3 — Reasoning (what's missing)&lt;br&gt;
RTD (Reasoning Trace Depth), re-plan reasons, plan-to-execution delta, decision confidence proxies. One structured log entry per agent task. This is the layer your dashboards don't have.&lt;br&gt;
The diagnostic flow when something feels wrong but dashboards are green:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Check Layer 1: Is infrastructure healthy?&lt;br&gt;
→ Yes → move to Layer 2&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Check Layer 2: Is RSI elevated? Is RAR degraded?&lt;br&gt;
→ RSI elevated → move to Layer 3&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Check Layer 3: Is RTD above baseline?&lt;br&gt;
→ RTD &amp;gt; 3 → agent is re-planning, find the tool/data source causing it&lt;br&gt;
→ RTD normal, HER elevated → agent is escalating cleanly, check decision envelope&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;What This Looks Like in CloudWatch&lt;br&gt;
pythonimport boto3&lt;/p&gt;

&lt;p&gt;cw = boto3.client('cloudwatch', region_name='us-east-1')&lt;/p&gt;

&lt;p&gt;def publish_rtd_metric(agent_id: str, rtd_value: int) -&amp;gt; None:&lt;br&gt;
    """&lt;br&gt;
    Publish Reasoning Trace Depth to CloudWatch.&lt;br&gt;
    Alert when RTD exceeds 3 — agent is re-planning excessively.&lt;br&gt;
    """&lt;br&gt;
    cw.put_metric_data(&lt;br&gt;
        Namespace='AgentSRE/Reasoning',&lt;br&gt;
        MetricData=[{&lt;br&gt;
            'MetricName': 'ReasoningTraceDepth',&lt;br&gt;
            'Dimensions': [{'Name': 'AgentId', 'Value': agent_id}],&lt;br&gt;
            'Value': float(rtd_value),&lt;br&gt;
            'Unit': 'Count'&lt;br&gt;
        }]&lt;br&gt;
    )&lt;br&gt;
Set your alarm at RTD &amp;gt; 3 sustained over a 5-minute window. That's your early warning before HER spikes, before users feel latency, before cost anomalies appear in your billing dashboard.&lt;/p&gt;

&lt;p&gt;The Connection to Your Existing SLI Framework&lt;br&gt;
If you've been following this series:&lt;/p&gt;

&lt;p&gt;Post 4 introduced HER — your human escalation signal. HER is what happens after the agent gives up re-planning.&lt;br&gt;
Post 7 introduced RSI — your retry storm signal at the control plane level.&lt;br&gt;
This post introduces RTD — the earlier, reasoning-level signal that predicts both RSI and HER before they breach.&lt;/p&gt;

&lt;p&gt;RTD → feeds → RSI → feeds → HER&lt;br&gt;
The three form a causal chain. If you're only watching HER, you're watching the end of the chain. RTD gives you the front.&lt;/p&gt;

&lt;p&gt;The Practical Checklist&lt;br&gt;
Before your next agent ships, add to your production-readiness checklist:&lt;br&gt;
☐ Decision trace structured logging configured (one JSON entry per task, not per span)&lt;br&gt;
☐ RTD metric emitting to CloudWatch / Prometheus&lt;br&gt;
☐ RTD baseline established (30-day shadow mode — same as HER baseline protocol)&lt;br&gt;
☐ RTD alarm set at threshold &amp;gt; 3&lt;br&gt;
☐ RTD correlated to HER in your dashboards — rising RTD without rising HER means the agent is struggling but not yet escalating&lt;br&gt;
Your OTel traces are correct. They're just answering the wrong question.&lt;br&gt;
&lt;a href="https://www.linkedin.com/posts/ajay-devineni_sre-agenticai-observability-activity-7462294037518159872-iF29?utm_source=share&amp;amp;utm_medium=member_desktop&amp;amp;rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU" rel="noopener noreferrer"&gt;https://www.linkedin.com/posts/ajay-devineni_sre-agenticai-observability-activity-7462294037518159872-iF29?utm_source=share&amp;amp;utm_medium=member_desktop&amp;amp;rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Ajay Devineni | AWS Community Builder | Senior SRE/Platform Engineer | github.com/Ajay150313/agentsre&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2z9lhkgk1fhuawt4t8zg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2z9lhkgk1fhuawt4t8zg.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>sre</category>
      <category>devops</category>
      <category>platformeng</category>
    </item>
    <item>
      <title>The AI Agent Cost Ceiling Problem: Why Your AWS Bill Is Your Reliability Alert</title>
      <dc:creator>Ajay Devineni</dc:creator>
      <pubDate>Mon, 11 May 2026 21:16:09 +0000</pubDate>
      <link>https://dev.to/ajaydevineni/the-ai-agent-cost-ceiling-problem-why-your-aws-bill-is-your-reliability-alert-3kn5</link>
      <guid>https://dev.to/ajaydevineni/the-ai-agent-cost-ceiling-problem-why-your-aws-bill-is-your-reliability-alert-3kn5</guid>
      <description>&lt;p&gt;Production AI agents fail on tool calls 3–15% of the time. That's not a failure rate you fix — it's a reality you design around.&lt;/p&gt;

&lt;p&gt;The teams that have designed around it have circuit breakers: token budgets, retry limits, cost anomaly alerts wired to incident response.&lt;/p&gt;

&lt;p&gt;The teams that haven't find out from their AWS bill.&lt;/p&gt;

&lt;p&gt;This article is about the reliability infrastructure between those two outcomes.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Retry Loop Failure Mode
&lt;/h2&gt;

&lt;p&gt;When an AI agent calls a tool and gets an ambiguous response — not an error, not a success, just something unexpected — most agents do what they're designed to do: they try again. And again. And again.&lt;/p&gt;

&lt;p&gt;Without a hard retry limit, this becomes a loop. Without a token budget cap, the loop has no ceiling. Without observability instrumentation specific to retry signatures, your standard dashboards show nothing unusual until the cost spike appears.&lt;/p&gt;

&lt;p&gt;In documented production deployments, the cost spike is the first operational signal that something has gone wrong. By that point, if the agent has write permissions and has queued remediation actions, the incident may have worsened before anyone noticed the loop.&lt;/p&gt;

&lt;p&gt;This is the reliability problem behind the cost problem. The bill is the symptom. The missing circuit breaker is the cause.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Standard SLIs Don't Catch It
&lt;/h2&gt;

&lt;p&gt;Request latency: normal. The agent is responding within SLO. Error rate: zero. Every call returns something — just not what the agent expected. Availability: 100%. The agent is up and running.&lt;/p&gt;

&lt;p&gt;The retry loop produces none of the infrastructure-layer signals your existing alerts are watching.&lt;/p&gt;

&lt;p&gt;What it does produce is a Tool Invocation Efficiency (TIE) anomaly — your agent is making 4, 6, 8 tool calls per task when its baseline is 2. That ratio climbing is your early warning. It fires before the billing cycle closes. It fires before the incident escalates.&lt;/p&gt;

&lt;p&gt;This is why TIE is a first-class SLI in the agentsre library. It catches what latency and error rate miss.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Three Circuit Breakers
&lt;/h2&gt;

&lt;p&gt;Every production AI agent needs three reliability controls specifically for the retry loop failure mode:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Hard Token Budget Per Session
&lt;/h3&gt;

&lt;p&gt;Set a maximum token count per agent session. Not a soft recommendation in the system prompt — a hard limit enforced at the infrastructure layer. When the agent hits the limit, it stops executing and routes to your escalation path.&lt;/p&gt;

&lt;p&gt;The budget should be sized at 3x your P95 task token usage. A task that normally uses 2,000 tokens gets a 6,000-token ceiling. Anything above that is a signal, not normal operation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agentsre&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AgentSLICollector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TaskRecord&lt;/span&gt;

&lt;span class="c1"&gt;# Track token usage as part of your task record
&lt;/span&gt;&lt;span class="n"&gt;collector&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;record&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;TaskRecord&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;t-001&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;task_class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;incident-analysis&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tool_calls&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;               &lt;span class="c1"&gt;# elevated — baseline is 2.3
&lt;/span&gt;    &lt;span class="n"&gt;decision_confidence&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.71&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;completed&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# TIE will catch the retry signature before the bill does
&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;collector&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;collect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;incident-analysis&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;breached&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;trigger_circuit_breaker&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Retry Loop Signature in Observability
&lt;/h3&gt;

&lt;p&gt;A retry loop has a distinctive signature: tool call count per task climbing above baseline, task completion time extending beyond P99, and decision confidence declining across sequential attempts.&lt;/p&gt;

&lt;p&gt;Configure a CloudWatch alarm on TIE drift: when tool calls per task exceed 2x baseline for 10 consecutive minutes, fire an alert. This is your early warning before the cost spike and before the incident escalates.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# CloudWatch alarm for retry loop detection
&lt;/span&gt;&lt;span class="n"&gt;aws&lt;/span&gt; &lt;span class="n"&gt;cloudwatch&lt;/span&gt; &lt;span class="n"&gt;put&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;alarm&lt;/span&gt; \
  &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;alarm&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AgentRetryLoopDetected&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; \
  &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ToolInvocationEfficiency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; \
  &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;namespace&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AgentReliability&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; \
  &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;statistic&lt;/span&gt; &lt;span class="n"&gt;Average&lt;/span&gt; \
  &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;period&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt; \
  &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="mf"&gt;2.0&lt;/span&gt; \
  &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;comparison&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;operator&lt;/span&gt; &lt;span class="n"&gt;GreaterThanThreshold&lt;/span&gt; \
  &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;evaluation&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;periods&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; \
  &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;alarm&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;actions&lt;/span&gt; &lt;span class="n"&gt;arn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;aws&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;REGION&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;ACCOUNT&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;AgentAlerts&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Cost Anomaly as Incident Trigger
&lt;/h3&gt;

&lt;p&gt;Wire your AWS Cost Anomaly Detection to your incident management system. An AI agent whose cost per hour doubles is experiencing a reliability event — treat it as one.&lt;/p&gt;

&lt;p&gt;Set a cost anomaly threshold at 150% of your rolling 7-day average for the relevant Lambda functions and Bedrock invocations. When it fires, it routes to the same on-call channel as your availability alerts — because it is an availability signal.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Numbers Behind This
&lt;/h2&gt;

&lt;p&gt;40% of agentic AI projects are expected to be cancelled by 2027. Cost overruns and inadequate risk controls rank in the top three reasons. These are not independent failure modes — they're the same failure mode at different stages of the same incident.&lt;/p&gt;

&lt;p&gt;The retry loop causes the cost overrun. The missing circuit breaker causes the retry loop. The missing circuit breaker exists because teams treat AI agent reliability as an application problem rather than an infrastructure problem requiring SRE governance.&lt;/p&gt;




&lt;h2&gt;
  
  
  What To Do Before Your Next Agent Goes Live
&lt;/h2&gt;

&lt;p&gt;Three checks before any AI agent touches production:&lt;/p&gt;

&lt;p&gt;Check 1: Does this agent have a hard token budget enforced at the infrastructure layer? Not a prompt instruction — a hard limit.&lt;/p&gt;

&lt;p&gt;Check 2: Is TIE instrumented per task class with a 2x-baseline breach alert configured?&lt;/p&gt;

&lt;p&gt;Check 3: Is cost anomaly detection wired to your incident management system for this agent's associated AWS resources?&lt;/p&gt;

&lt;p&gt;If any answer is no — the agent is not production-ready. It is demo-ready.&lt;/p&gt;

&lt;p&gt;The circuit breaker for the retry loop costs an afternoon to build. The absence of it costs the project.&lt;/p&gt;

&lt;p&gt;Open-source implementation: github.com/Ajay150313/agentsre — the agentsre library instruments TIE, DQR, HER, and AQDD out of the box with AWS CloudWatch integration.&lt;/p&gt;

&lt;p&gt;LinkedIn discussion: &lt;a href="https://www.linkedin.com/posts/ajay-devineni_sre-agenticai-reliability-ugcPost-7459711021738307584-x6cv?utm_source=share&amp;amp;utm_medium=member_desktop&amp;amp;rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU" rel="noopener noreferrer"&gt;https://www.linkedin.com/posts/ajay-devineni_sre-agenticai-reliability-ugcPost-7459711021738307584-x6cv?utm_source=share&amp;amp;utm_medium=member_desktop&amp;amp;rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What ceiling do you have today when an agent starts looping?&lt;/p&gt;

</description>
      <category>sre</category>
      <category>agenticai</category>
      <category>reliability</category>
      <category>devops</category>
    </item>
    <item>
      <title>The Double-Exposure Problem: When AI Agents and AI-Generated Code Fail Together</title>
      <dc:creator>Ajay Devineni</dc:creator>
      <pubDate>Fri, 08 May 2026 02:02:45 +0000</pubDate>
      <link>https://dev.to/ajaydevineni/the-double-exposure-problem-when-ai-agents-and-ai-generated-code-fail-together-1amk</link>
      <guid>https://dev.to/ajaydevineni/the-double-exposure-problem-when-ai-agents-and-ai-generated-code-fail-together-1amk</guid>
      <description>&lt;p&gt;Amazon's March 2026 AI outages — two separate incidents within three days, totaling more than 6 million lost orders — have done something unusual for the SRE community: they've made a failure mode visible that most teams have been quietly carrying in their production systems without acknowledging.&lt;/p&gt;

&lt;p&gt;The incidents were traced to AI-generated code changes deployed without adequate approval gates. Amazon's response was a 90-day code safety reset across 335 critical systems, with a new requirement that AI-assisted code changes be reviewed by senior engineers before deployment.&lt;/p&gt;

&lt;p&gt;That response is SRE discipline. Applied reactively. This article is about applying it proactively — and about a compounding failure mode most teams haven't modeled yet.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Double-Exposure Problem
&lt;/h2&gt;

&lt;p&gt;The SRE concept of blast radius asks: when a component fails, what is the maximum scope of impact? Most blast radius models assume that the failing component is one thing — a service, a database, a network partition.&lt;/p&gt;

&lt;p&gt;In 2026 production environments, a new blast radius scenario is emerging that most models don't account for:&lt;/p&gt;

&lt;p&gt;What happens when your AI agent and the AI-generated code it runs on fail simultaneously?&lt;/p&gt;

&lt;p&gt;This is the double-exposure problem. It has three components:&lt;/p&gt;

&lt;p&gt;Exposure 1 — AI runtime behavior. Your AI agent operates non-deterministically. Its decisions, tool selections, and reasoning paths vary across invocations. Standard observability — latency, error rate, availability — does not instrument this layer. The semantic failure modes (wrong decisions, context drift, tool compensation) are invisible to your dashboards.&lt;/p&gt;

&lt;p&gt;Exposure 2 — AI-generated code changes. Your CI/CD pipeline uses AI assistance to generate infrastructure changes, configuration updates, or application code. According to Lightrun's 2026 survey of 200 senior SRE and DevOps leaders, 43% of these changes require manual debugging in production even after passing QA. Not a single survey respondent expressed "very confidence" that AI-generated code would behave correctly in production.&lt;/p&gt;

&lt;p&gt;Exposure 3 — The interaction.** When an AI-generated code change deploys to the same environment your agent is operating in, you have two non-deterministic systems interacting. The code change may alter the agent's tool environment, context window, or available action space in ways that manifest as behavioral drift — drift that your current instrumentation will miss because it's measuring infrastructure, not agent behavior.&lt;/p&gt;

&lt;p&gt;The result: a production incident that looks like agent degradation. The root cause is a code change. The RCA takes hours because the investigation starts at the wrong layer.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Standard Observability Misses This
&lt;/h2&gt;

&lt;p&gt;IEEE Spectrum described this failure class in their recent article on quiet AI failures: every monitoring dashboard reads healthy while users report that system decisions are becoming wrong.&lt;/p&gt;

&lt;p&gt;This is structurally identical to what happens in the double-exposure scenario. A code change that subtly alters an agent's tool environment produces no infrastructure-layer signal. The agent's HTTP responses stay at 200. Latency stays within SLO. Error budget stays unburned.&lt;/p&gt;

&lt;p&gt;What changes is the agent's Decision Quality Rate — the percentage of decisions falling within expected behavioral bounds. And Tool Invocation Efficiency — the ratio of tool calls per task completion. And eventually Human Escalation Rate — the percentage of tasks requiring intervention.&lt;/p&gt;

&lt;p&gt;None of these are instrumented in a standard observability stack. All of them detect the double-exposure failure mode before it reaches user impact.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Governance Framework
&lt;/h2&gt;

&lt;p&gt;Amazon's 90-day reset is a retroactive version of what proactive SRE governance looks like. Here are the four components that matter, drawn from first principles rather than post-incident response:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The AI Code Change Approval Gate
&lt;/h3&gt;

&lt;p&gt;Every code change touching an AI agent's runtime environment — its tools, configuration, action space, or infrastructure — should require explicit approval before deployment. Not because AI code generation is untrustworthy, but because non-deterministic code changes interacting with non-deterministic runtime systems have a compounding failure surface that standard CI/CD testing cannot fully cover.&lt;/p&gt;

&lt;p&gt;This is not a new concept. Amazon has now required it. The cost of implementing it proactively is hours. The cost of discovering it's missing is incidents.&lt;/p&gt;

&lt;p&gt;Implementation: A dedicated approval stage in your deployment pipeline for changes flagged as AI-generated or agent-environment-adjacent. This is distinct from your standard peer review — it specifically evaluates: does this change touch any agent's tool environment, context configuration, or action space?&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Behavioral Baseline Snapshots Around Code Deployments
&lt;/h3&gt;

&lt;p&gt;Apply the same framework version governance pattern to AI code changes: snapshot your agent's behavioral baselines before the change deploys, and compare post-deployment behavior against them.&lt;/p&gt;

&lt;p&gt;Specifically, capture per-task-class TIE and DQR baselines immediately before any deployment that touches your agent's environment. Run the deployment in a shadow environment for a minimum review period. If TIE drifts more than 15% or DQR drops more than 15%, flag for human review before promoting to production.&lt;/p&gt;

&lt;p&gt;This is the instrumentation that would have surfaced Amazon's failure earlier in the pipeline — not at the infrastructure layer, but at the behavioral layer where the actual impact manifested.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agentsre&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AgentSLICollector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TaskRecord&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agentsre.sprawl&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FrameworkVersionGovernance&lt;/span&gt;

&lt;span class="c1"&gt;# Capture baseline before deployment
&lt;/span&gt;&lt;span class="n"&gt;gov&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FrameworkVersionGovernance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;tie_drift_threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;dqr_drift_threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.85&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;min_shadow_samples&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;gov&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;snapshot_baseline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;task_class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-task-class&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;framework_version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pre-ai-code-change&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tie_values&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;current_tie_samples&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;dqr_values&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;current_dqr_samples&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# After shadow deployment — evaluate before promoting
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gov&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;evaluate_upgrade&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;task_class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-task-class&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;production_version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pre-ai-code-change&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;shadow_version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;post-ai-code-change&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;decision&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;UpgradeDecision&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;BLOCK&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;block_deployment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;block_reason&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. A Blast Radius Model for Double-Exposure
&lt;/h3&gt;

&lt;p&gt;Most blast radius models assume one failing component. Run the double-exposure calculation explicitly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which of your production services depend on AI agents?&lt;/li&gt;
&lt;li&gt;Which code paths in those services are AI-generated?&lt;/li&gt;
&lt;li&gt;If both the agent's semantic behavior and the underlying code fail simultaneously, what is the maximum scope of user impact?&lt;/li&gt;
&lt;li&gt;What is the safe degradation sequence — which agent capabilities can you reduce autonomously, and in what order?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This calculation should exist as a named document, owned by a named person, reviewed quarterly. It is the blast radius equivalent of a fire drill — done in advance so the answer is known before the incident.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. A Proactive Runbook — Not Amazon's Retroactive Reset
&lt;/h3&gt;

&lt;p&gt;Amazon's 90-day reset is a retroactive runbook. Write yours proactively. A minimum viable AI code reliability runbook covers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Detection: Which metrics signal that an AI code change has degraded agent behavior? (Answer: TIE drift, DQR drop, HER increase — not latency or error rate)&lt;/li&gt;
&lt;li&gt;Attribution: How do you determine whether the degradation is agent behavior, code change, or model drift? (Answer: compare against behavioral baseline snapshots captured pre-deployment)&lt;/li&gt;
&lt;li&gt;Containment: What is the fastest path to reverting the code change while maintaining partial agent operation? (Answer: the progressive autonomy constraint ladder — not a binary kill switch)&lt;/li&gt;
&lt;li&gt;Recovery criteria: When is it safe to redeploy? (Answer: shadow behavioral baselines within ±15% of production baseline for 30 consecutive minutes)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The SRE Perspective on AI Code Generation
&lt;/h2&gt;

&lt;p&gt;The Lightrun finding that 88% of SRE leaders need two to three redeploy cycles to verify an AI-generated fix suggests something straightforward: the testing and verification frameworks for AI-generated code have not kept pace with the adoption of AI code generation.&lt;/p&gt;

&lt;p&gt;This is the same lag that produced Amazon's incidents. And it's the same lag that the SRE community has closed before — with microservices, with Kubernetes, with cloud-native architectures. Each time, capability arrived before governance. The SRE discipline developed the governance.&lt;/p&gt;

&lt;p&gt;The governance for AI-generated code in agent environments exists. Error budgets, blast radius models, approval gates, behavioral baseline comparison — these are standard SRE tools. They need to be applied to a new layer of the stack.&lt;/p&gt;

&lt;p&gt;The open-source implementation is at github.com/Ajay150313/agentsre. The FrameworkVersionGovernance module handles behavioral baseline capture and comparison. The progressive constraint ladder handles safe degradation. Both work for AI code change governance as directly as they do for framework upgrades.&lt;/p&gt;

&lt;p&gt;Amazon spent 6.3 million lost orders learning this lesson. Most teams can learn it for the cost of an afternoon.&lt;/p&gt;




&lt;h2&gt;
  
  
  What To Do This Week
&lt;/h2&gt;

&lt;p&gt;If you're running AI agents in production and using AI-assisted code generation in the same environment:&lt;/p&gt;

&lt;p&gt;Today: Identify which code changes in your last 30 days touched your agent's tool environment, configuration, or action space. Determine whether any were AI-generated. If yes — were they reviewed specifically for agent-environment impact?&lt;/p&gt;

&lt;p&gt;This week: Add an AI code change flag to your deployment pipeline. Start capturing TIE and DQR baselines around any deployment flagged as agent-environment-adjacent.&lt;/p&gt;

&lt;p&gt;This month: Run the double-exposure blast radius calculation. Document the result. Assign an owner. Review it with your team.&lt;/p&gt;

&lt;p&gt;The Amazon incidents happened in March. The Lightrun survey data was collected in January. IEEE Spectrum is calling quiet failure one of the defining challenges of the year.&lt;/p&gt;

&lt;p&gt;The signal is clear. The governance frameworks exist.&lt;/p&gt;




&lt;p&gt;Open-source implementation: github.com/Ajay150313/agentsre&lt;br&gt;
LinkedIn discussion: &lt;a href="https://www.linkedin.com/posts/ajay-devineni_sre-agenticai-reliability-activity-7458330530212835328-36__?utm_source=share&amp;amp;utm_medium=member_desktop&amp;amp;rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU" rel="noopener noreferrer"&gt;https://www.linkedin.com/posts/ajay-devineni_sre-agenticai-reliability-activity-7458330530212835328-36__?utm_source=share&amp;amp;utm_medium=member_desktop&amp;amp;rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What does your current approval gate for AI-generated code look like? Or is this the first time you've run the double-exposure calculation?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>sre</category>
      <category>devops</category>
      <category>aws</category>
    </item>
    <item>
      <title>The Agent Control Plane is an SRE Problem: Governing the Orchestration Layer Nobody is Watching</title>
      <dc:creator>Ajay Devineni</dc:creator>
      <pubDate>Mon, 04 May 2026 23:57:35 +0000</pubDate>
      <link>https://dev.to/ajaydevineni/the-agent-control-plane-is-an-sre-problem-governing-the-orchestration-layer-nobody-is-watching-4i72</link>
      <guid>https://dev.to/ajaydevineni/the-agent-control-plane-is-an-sre-problem-governing-the-orchestration-layer-nobody-is-watching-4i72</guid>
      <description>&lt;p&gt;IBM's Distinguished Engineer Chris Hay declared this week that "agent control planes and multi-agent dashboards become real in 2026." Gartner projects that 40% of enterprise applications will use task-specific AI agents by 2026. The orchestration infrastructure to manage all of those agents — the control plane — is becoming the most critical and least governed layer in production AI.&lt;/p&gt;

&lt;p&gt;This article applies SRE discipline to the agent control plane: what it is, what failure modes it introduces, and what instrumentation it requires before it goes to production.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is an Agent Control Plane?
&lt;/h2&gt;

&lt;p&gt;In 2026, an agent control plane is the orchestration layer that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Receives tasks from humans or upstream systems&lt;/li&gt;
&lt;li&gt;Decomposes them into subtasks&lt;/li&gt;
&lt;li&gt;Routes subtasks to specialist agents&lt;/li&gt;
&lt;li&gt;Manages retry, rescheduling, and priority queues across the agent fleet&lt;/li&gt;
&lt;li&gt;Makes autonomous decisions about resource allocation when demand spikes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The control plane is distinct from the agents it manages. It is infrastructure — the same way a Kubernetes control plane is distinct from the pods it schedules.&lt;/p&gt;

&lt;p&gt;This distinction matters for reliability: when the control plane degrades, it does not degrade one agent. It degrades the entire fleet simultaneously.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Control Plane Failure Taxonomy
&lt;/h2&gt;

&lt;p&gt;Control plane failures are uniquely difficult to detect because they do not look like single-agent failures. They look like correlated degradation across multiple agents — which standard observability interprets as coincidence or noise.&lt;/p&gt;

&lt;h3&gt;
  
  
  Failure Class 1: Routing Drift
&lt;/h3&gt;

&lt;p&gt;The control plane misassigns tasks to suboptimal agents — sending high-complexity reasoning tasks to agents specialized for retrieval, or routing compliance-sensitive tasks through agents without the required tool access. Each individual agent appears healthy. The control plane's routing logic is the failure.&lt;/p&gt;

&lt;p&gt;Observable signal: fleet-wide DQR drops across unrelated task classes simultaneously.&lt;/p&gt;

&lt;h3&gt;
  
  
  Failure Class 2: Retry Storms
&lt;/h3&gt;

&lt;p&gt;When multiple downstream agents fail simultaneously, the control plane retries across its full routing table. Each retry generates additional tool calls. If the control plane does not implement backoff and circuit breaking at the routing layer, a partial agent outage generates a retry storm that saturates the entire MCP tool layer.&lt;/p&gt;

&lt;p&gt;Observable signal: fleet-wide TIE spike not attributable to any single agent or task class.&lt;/p&gt;

&lt;h3&gt;
  
  
  Failure Class 3: Priority Queue Starvation
&lt;/h3&gt;

&lt;p&gt;Under load, control planes must prioritize. If the priority algorithm fails — or if it was never set — low-priority tasks consume resources that high-priority tasks need. Users of business-critical workflows experience silent slowdown while batch jobs consume capacity.&lt;/p&gt;

&lt;p&gt;Observable signal: AQDD breaches across multiple task classes with no corresponding error rate increase.&lt;/p&gt;

&lt;h3&gt;
  
  
  Failure Class 4: Decomposition Accuracy Degradation
&lt;/h3&gt;

&lt;p&gt;As task complexity increases, the control plane's decomposition logic produces subtask sets that are incomplete, redundant, or contradictory. Individual agents execute their subtasks correctly. The composed result is wrong because the decomposition was wrong.&lt;/p&gt;

&lt;p&gt;Observable signal: HER climbs fleet-wide — humans are intervening not because agents failed, but because the task decomposition produced nonsensical results.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three SLIs Your Control Plane Needs
&lt;/h2&gt;

&lt;p&gt;I extend the agentsre SLI framework with three control plane-specific measurements:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Routing Accuracy Rate (RAR)
&lt;/h3&gt;

&lt;p&gt;The percentage of task assignments that match the optimal agent for the task class, measured against a labeled evaluation set.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;RAR(t, w) = (correct_assignments / total_assignments) × 100
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Baseline during a 30-day calibration window. Alert when RAR drops &amp;gt;15% from baseline — this is the signal that routing logic has drifted, usually because a new task class was added without updating routing rules.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Retry Storm Index (RSI)
&lt;/h3&gt;

&lt;p&gt;The ratio of retry-generated tool calls to primary-invocation tool calls across the fleet in a rolling window.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;RSI(t, w) = retry_tool_calls / primary_tool_calls
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Normal RSI baseline is typically 0.05–0.15 (5–15% of tool calls are retries). RSI &amp;gt; 0.50 indicates retry storm conditions. RSI &amp;gt; 1.0 means more retry traffic than primary traffic — the control plane is in a positive feedback loop.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Decomposition Completeness Score (DCS)
&lt;/h3&gt;

&lt;p&gt;The percentage of decomposed subtask sets that, when executed, produce outputs covering all requirements of the original task.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;DCS requires a completeness validator per task class.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This is the hardest to instrument — it requires semantic understanding of task requirements. Start with a rule-based validator for your highest-volume task classes before attempting ML-based validation.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Control Plane Governance Model
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Separate SLO Ownership
&lt;/h3&gt;

&lt;p&gt;The control plane is not owned by the same person who owns the agents. It is a separate system with a separate error budget. The control plane SLO owner:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is paged when RAR drops &amp;gt;15% from baseline&lt;/li&gt;
&lt;li&gt;Is paged when RSI exceeds 0.50 for 10+ minutes&lt;/li&gt;
&lt;li&gt;Owns the retry storm runbook&lt;/li&gt;
&lt;li&gt;Reviews control plane decomposition logic on every new task class addition&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Retry Storm Runbook (minimum viable version)
&lt;/h3&gt;

&lt;p&gt;Every production control plane needs this runbook before launch:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Detection&lt;/strong&gt;: RSI &amp;gt; 0.50 sustained 10 minutes → page control plane owner&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Immediate action&lt;/strong&gt;: Reduce control plane retry limit from default (3) to 1&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Circuit breaking&lt;/strong&gt;: Identify failing agents via fleet-wide TIE spike attribution. Apply circuit breaker (open at 85% semantic validation rate)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recovery&lt;/strong&gt;: Restore retry limit only after RSI returns to &amp;lt; 0.20 for 15 consecutive minutes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Postmortem trigger&lt;/strong&gt;: Any RSI &amp;gt; 1.0 event requires a postmortem within 48 hours&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Control Plane Version Governance
&lt;/h3&gt;

&lt;p&gt;Apply the same framework upgrade governance to control plane versions as to agent framework versions: snapshot RAR, RSI, and DCS baselines before any control plane update. Run shadow traffic. Block promotion if any metric drifts beyond threshold.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation on AWS
&lt;/h2&gt;

&lt;p&gt;The three control plane SLIs instrument naturally on Bedrock's orchestration layer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;RAR&lt;/strong&gt;: Evaluate routing decisions by comparing &lt;code&gt;agentId&lt;/code&gt; in Bedrock orchestration traces against a task-class-to-optimal-agent mapping in DynamoDB&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RSI&lt;/strong&gt;: Count &lt;code&gt;RETRY&lt;/code&gt; events vs &lt;code&gt;INVOKE&lt;/code&gt; events in Bedrock CloudWatch logs, published as a ratio metric per 5-minute window&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DCS&lt;/strong&gt;: Lambda validator comparing subtask outputs against original task requirements, triggered by task completion events via EventBridge&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Full implementation is in the agentsre library: &lt;a href="https://github.com/Ajay150313/agentsre" rel="noopener noreferrer"&gt;https://github.com/Ajay150313/agentsre&lt;/a&gt; &lt;/p&gt;

&lt;h2&gt;
  
  
  Connecting the Arc
&lt;/h2&gt;

&lt;p&gt;This is the fifth layer of the AI-SRE reliability framework:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Single-agent SLIs (DQR, TIE, HER, AQDD)&lt;/li&gt;
&lt;li&gt;A2A semantic boundary validation + circuit breaker&lt;/li&gt;
&lt;li&gt;Agent Sprawl governance (fleet inventory, framework canary, deprecation alerting)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Agent Control Plane SLIs (RAR, RSI, DCS) — this article&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each layer adds governance to the next abstraction level of the same infrastructure problem: autonomous AI operating in production without adequate reliability discipline.&lt;/p&gt;

&lt;p&gt;LinkedIn discussion: &lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feko995q2bjjulk4yb93z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feko995q2bjjulk4yb93z.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;a href="https://www.linkedin.com/posts/ajay-devineni_agenticai-sre-controlplane-share-7457213748500475904-yi9g?utm_source=share&amp;amp;utm_medium=member_desktop&amp;amp;rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU" rel="noopener noreferrer"&gt;https://www.linkedin.com/posts/ajay-devineni_agenticai-sre-controlplane-share-7457213748500475904-yi9g?utm_source=share&amp;amp;utm_medium=member_desktop&amp;amp;rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What's the biggest control plane reliability gap in your environment?&lt;/p&gt;

</description>
      <category>sre</category>
      <category>devops</category>
      <category>agentaichallenge</category>
      <category>python</category>
    </item>
    <item>
      <title>Agent Sprawl is Your Next Production Incident: An SRE Response to Datadog's State of AI Engineering 2026</title>
      <dc:creator>Ajay Devineni</dc:creator>
      <pubDate>Fri, 01 May 2026 01:20:45 +0000</pubDate>
      <link>https://dev.to/ajaydevineni/agent-sprawl-is-your-next-production-incident-an-sre-response-to-datadogs-state-of-ai-engineering-3k83</link>
      <guid>https://dev.to/ajaydevineni/agent-sprawl-is-your-next-production-incident-an-sre-response-to-datadogs-state-of-ai-engineering-3k83</guid>
      <description>&lt;p&gt;Datadog published the State of AI Engineering 2026 report this week — real telemetry from over a thousand production environments. Read it. It is the most comprehensive look at AI in production available right now.&lt;/p&gt;

&lt;p&gt;I want to respond from the reliability engineering perspective, because the data reveals a problem the report names but doesn't fully resolve: agent sprawl is now a production reliability crisis, and the SRE discipline does not yet have governance frameworks for it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Data Shows
&lt;/h2&gt;

&lt;p&gt;Three findings stand out from an SRE perspective:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Framework adoption doubled year over year.&lt;/strong&gt; LangChain, LangGraph, Pydantic AI, Vercel AI SDK — up from 9% of organizations in early 2025 to nearly 18% by 2026. Services using agentic frameworks: more than doubled.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;70%+ of organizations run three or more models.&lt;/strong&gt; The share running more than six models nearly doubled. Teams are building model portfolios rather than committing to a single provider.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Teams add models faster than they retire them.&lt;/strong&gt; Datadog calls this "LLM tech debt." Each overlapping model introduces its own quality, latency, and cost profile. The report is explicit: this becomes a governance problem.&lt;/p&gt;

&lt;p&gt;These three findings combine to describe an environment growing faster than it can be governed. I call this &lt;strong&gt;Agent Sprawl&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Defining Agent Sprawl
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Agent Sprawl&lt;/strong&gt; — the condition where AI agent infrastructure complexity (frameworks, models, tool layers, orchestration patterns) grows faster than your ability to measure and govern its reliability.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It is structurally identical to the microservices sprawl problem SRE teams faced between 2015 and 2020. Teams added services faster than they added SLOs. The result: production incidents nobody could attribute because the dependency graph was too complex to observe.&lt;/p&gt;

&lt;p&gt;Agent Sprawl has three specific manifestations:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Framework-Invisible Call Complexity
&lt;/h3&gt;

&lt;p&gt;When you add LangChain, LangGraph, or any orchestration framework, it adds steps and paths you did not write — retry logic, fallback handlers, context window management, tool routing. All of this happens between your application code and your observability layer.&lt;/p&gt;

&lt;p&gt;Your SLIs measure at the application boundary. Framework-added calls are invisible.&lt;/p&gt;

&lt;p&gt;This means your Tool Invocation Efficiency (TIE) baseline — tool calls per task completion — is measuring a mix of your agent's behavior and your framework's behavior. When you upgrade the framework, both change simultaneously. You cannot separate them.&lt;/p&gt;

&lt;p&gt;In practice, across regulated production environments I've studied: TIE baselines can drift 30–40% after a framework major version upgrade with no corresponding change in the agent's task logic. The baseline shift looks like agent degradation. It's actually framework overhead. Teams spend hours on a false RCA.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix:&lt;/strong&gt; Instrument at the framework output layer, not the application layer. Capture tool invocations after framework processing. Then freeze your TIE baseline before any upgrade and compare shadow traffic before promoting.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Multi-Model SLO Orphaning
&lt;/h3&gt;

&lt;p&gt;70% of organizations running 3+ models means 70% have at least two additional SLO ownership gaps they haven't acknowledged.&lt;/p&gt;

&lt;p&gt;SLOs are set once — typically when the first model is deployed. As models 2, 3, 4, 5, 6 are added for specific task classes, latency profiles, or cost tiers, nobody revisits the SLO ownership model. Models run in production with no named owner, no baseline, no error budget.&lt;/p&gt;

&lt;p&gt;When model 3 degrades, there is no owner to page, no baseline to compare against, no runbook to execute. The degradation surfaces as a customer complaint, not an alert.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix:&lt;/strong&gt; Treat every model in your fleet like a microservice. Each model gets: a named owner (not a team — a person), a task-class-specific SLO, and a 30-day observation baseline before the SLO is enforced.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. LLM Tech Debt as a Reliability Liability
&lt;/h3&gt;

&lt;p&gt;Deprecated models running in agent chains create silent compatibility risks. When a provider announces deprecation, teams with models buried inside multi-step chains often miss the migration window. The model ages. Safety training falls behind. Decision Quality Rate declines slowly — too slowly to trigger a threshold alert — until accumulated drift surfaces as a production incident.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix:&lt;/strong&gt; Treat model deprecation notices the same way you treat dependency CVEs. Automate alerts at 60, 30, and 7 days before end-of-life. Build the migration ticket at announcement time, not at expiry.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Governance Framework Agent Sprawl Needs
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Agent Fleet Inventory
&lt;/h3&gt;

&lt;p&gt;Before you can govern sprawl, you need to know what you're governing. Maintain a living inventory with, for each component: framework and version, model(s) used, task classes handled, named SLO owner, current TIE/DQR baselines, and deprecation dates.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agentsre.sprawl&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AgentFleetInventory&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;FleetComponent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ComponentType&lt;/span&gt;

&lt;span class="n"&gt;inventory&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AgentFleetInventory&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;inventory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;register&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;FleetComponent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;component_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic.claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;component_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ComponentType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MODEL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payment-processor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;task_classes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payment-routing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fraud-detection&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;slo_owner&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;owner@team.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                    &lt;span class="c1"&gt;# named human — not a team
&lt;/span&gt;    &lt;span class="n"&gt;baseline_established_at&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2026-04-01&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;deprecation_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2027-06-01&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;last_slo_review&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2026-04-01&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;current_tie_baseline&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;2.4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;current_dqr_baseline&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;91.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;report&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;inventory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;quarterly_review_report&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Fleet governance score: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;fleet_governance_score&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/100&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Framework Version Governance — Canary Before Promotion
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agentsre.sprawl&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FrameworkVersionGovernance&lt;/span&gt;

&lt;span class="n"&gt;gov&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FrameworkVersionGovernance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;tie_drift_threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# block if TIE drifts &amp;gt;15%
&lt;/span&gt;    &lt;span class="n"&gt;dqr_drift_threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.85&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# block if DQR drops &amp;gt;15%
&lt;/span&gt;    &lt;span class="n"&gt;min_shadow_samples&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Before upgrade: snapshot production baseline
&lt;/span&gt;&lt;span class="n"&gt;gov&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;snapshot_baseline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payment-processor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;task_class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payment-routing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;framework_version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;langchain-0.2.x&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tie_values&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;production_tie_samples&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;dqr_values&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;production_dqr_samples&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# After 48hrs shadow traffic:
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gov&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;evaluate_upgrade&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payment-processor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;task_class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payment-routing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;production_version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;langchain-0.2.x&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;shadow_version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;langchain-0.3.x&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;decision&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;UpgradeDecision&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;BLOCK&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;rollback&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;   &lt;span class="c1"&gt;# framework added hidden overhead — don't promote
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Quarterly Multi-Model SLO Review
&lt;/h3&gt;

&lt;p&gt;The review should take 30–60 minutes per quarter. For every model in fleet:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Verify named owner exists&lt;/li&gt;
&lt;li&gt;Verify baseline is current (&amp;lt; 90 days old)&lt;/li&gt;
&lt;li&gt;Check deprecation schedule against provider announcements&lt;/li&gt;
&lt;li&gt;Review TIE per-model — models with rising TIE relative to task class baseline are drifting&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Models scoring below 70 on the governance health score are flagged as governance debt requiring a 30-day remediation window.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Datadog Report's Implicit Challenge
&lt;/h2&gt;

&lt;p&gt;The State of AI Engineering 2026 describes an industry in rapid expansion. What it does not fully resolve is the SRE question: who governs all of this, and what does that look like in practice?&lt;/p&gt;

&lt;p&gt;The SRE community has solved exactly this class of problem before — in distributed systems, in microservices, in cloud infrastructure. The discipline already exists. It needs to be applied to the AI agent layer now, before agent sprawl becomes agent chaos.&lt;/p&gt;

&lt;p&gt;The Datadog data tells us the window is closing. Framework adoption doubles in a year. Multi-model fleets become the norm. Model debt accumulates.&lt;/p&gt;

&lt;p&gt;Build the governance layer before the production incidents start.&lt;/p&gt;




&lt;p&gt;Open-source implementation: [&lt;a href="https://github.com/Ajay150313/agentsre" rel="noopener noreferrer"&gt;https://github.com/Ajay150313/agentsre&lt;/a&gt;]&lt;br&gt;
LinkedIn discussion: [&lt;a href="https://www.linkedin.com/posts/ajay-devineni_agenticai-sre-reliability-ugcPost-7455786901673902080-BCRM?utm_source=share&amp;amp;utm_medium=member_desktop&amp;amp;rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU" rel="noopener noreferrer"&gt;https://www.linkedin.com/posts/ajay-devineni_agenticai-sre-reliability-ugcPost-7455786901673902080-BCRM?utm_source=share&amp;amp;utm_medium=member_desktop&amp;amp;rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU&lt;/a&gt;]&lt;/p&gt;

&lt;p&gt;What's your biggest agent sprawl challenge right now?&lt;/p&gt;

</description>
      <category>sre</category>
      <category>agentaichallenge</category>
      <category>devops</category>
      <category>cloudnative</category>
    </item>
  </channel>
</rss>
