<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ajay Devineni</title>
    <description>The latest articles on DEV Community by Ajay Devineni (@ajaydevineni).</description>
    <link>https://dev.to/ajaydevineni</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3862822%2Fddbc52cd-519d-4344-bea2-effb2a513786.png</url>
      <title>DEV Community: Ajay Devineni</title>
      <link>https://dev.to/ajaydevineni</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ajaydevineni"/>
    <language>en</language>
    <item>
      <title>SOC2 Is a Report, Not a Security Program published: true tags: security, devops, compliance, cloud</title>
      <dc:creator>Ajay Devineni</dc:creator>
      <pubDate>Sat, 27 Jun 2026 01:02:50 +0000</pubDate>
      <link>https://dev.to/ajaydevineni/soc2-is-a-report-not-a-security-programpublished-truetags-security-devops-compliance-cloud-3nm3</link>
      <guid>https://dev.to/ajaydevineni/soc2-is-a-report-not-a-security-programpublished-truetags-security-devops-compliance-cloud-3nm3</guid>
      <description>&lt;p&gt;description: SOC2 measures whether you have a process, not whether the process works. Here's what real security looks like and why the audit doesn't capture it.&lt;/p&gt;

&lt;p&gt;SOC2 has done more to harm security than help it.&lt;/p&gt;

&lt;p&gt;Not the concept. The theater around it.&lt;/p&gt;

&lt;p&gt;I've watched companies pass SOC2 with MFA "enforced" through a policy document nobody enforces. Access reviews where managers approve 200 entitlements in three minutes. Encryption "in transit and at rest" that stops at the load balancer.&lt;/p&gt;

&lt;p&gt;The auditor signs off. The Type II report goes in the sales deck. Everyone moves on.&lt;/p&gt;

&lt;p&gt;Meanwhile the shared admin credential is still in a Slack DM from 2021.&lt;/p&gt;

&lt;p&gt;The actual problem with the framework&lt;/p&gt;

&lt;p&gt;SOC2 measures whether you have a process, not whether the process works.&lt;/p&gt;

&lt;p&gt;You can document a terrible control consistently and pass. You can run an excellent informal practice and fail.&lt;/p&gt;

&lt;p&gt;The framework has no opinion on outcomes — only on documentation. That distinction matters enormously when you're trying to decide how much weight to give a vendor's compliance report.&lt;/p&gt;

&lt;p&gt;What real security looks like&lt;/p&gt;

&lt;p&gt;Real security looks like:&lt;/p&gt;

&lt;p&gt;Blast radius limits on IAM — not policies that exist, but policies that are scoped, reviewed, and enforced at the boundary&lt;br&gt;
Short-lived credentials everywhere — assume-role with time bounds, not long-lived keys sitting in CI secrets&lt;br&gt;
Peer-reviewed infrastructure changes — the same code review culture you apply to application code&lt;br&gt;
Alerting on identity anomalies — not just "did login succeed" but "is this login pattern normal for this principal at this time"&lt;br&gt;
On-call engineers who can actually contain an incident at 3am — not runbooks that assume the reader has six hours and a working Slack&lt;/p&gt;

&lt;p&gt;None of that is uniquely a SOC2 control. Most of it isn't measured by the audit at all.&lt;/p&gt;

&lt;p&gt;The diagnostic question&lt;/p&gt;

&lt;p&gt;If your security program would collapse the day after the auditor leaves, you don't have a security program. You have a report.&lt;/p&gt;

&lt;p&gt;Compliance is the floor. We keep treating it like the ceiling.&lt;/p&gt;

&lt;p&gt;Four questions worth pressure-testing&lt;/p&gt;

&lt;p&gt;When did someone last actually test the incident response runbook end-to-end — live, with a timer?&lt;br&gt;
How long does it take to rotate every secret in production if one leaks today?&lt;br&gt;
How many engineers have production IAM permissions they haven't used in 90 days?&lt;br&gt;
Can you enumerate every service account and what it can do?&lt;/p&gt;

&lt;p&gt;If any of those answers are "unclear" — the SOC2 Type II report won't change that. The report just means you documented the gap consistently.&lt;/p&gt;

&lt;p&gt;What controls look good on paper but fail in practice in your environment? Genuinely curious what patterns others are seeing.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>security</category>
      <category>iam</category>
      <category>sre</category>
    </item>
    <item>
      <title>If You're a Great Developer, You're Probably a Terrible SRE Here's How to Encode SRE Paranoia Into Your Pipeline So Engineers Don't Have to Think Like One</title>
      <dc:creator>Ajay Devineni</dc:creator>
      <pubDate>Wed, 24 Jun 2026 00:42:34 +0000</pubDate>
      <link>https://dev.to/ajaydevineni/if-youre-a-great-developer-youre-probably-a-terrible-sre-heres-how-to-encode-sre-paranoia-into-4i08</link>
      <guid>https://dev.to/ajaydevineni/if-youre-a-great-developer-youre-probably-a-terrible-sre-heres-how-to-encode-sre-paranoia-into-4i08</guid>
      <description>&lt;p&gt;The LinkedIn post I published three days ago triggered something I didn't expect.&lt;br&gt;
Not disagreement — validation. Senior engineers from payments companies, healthcare platforms, and financial infrastructure all said the same thing in different words: they've watched brilliant developers ship changes that worked perfectly in staging and detonated in production, not because the engineers were careless, but because their entire training rewards forward motion while reliability work rewards the opposite.&lt;br&gt;
This post is the practical follow-up. The opinion was: developers and SREs think differently, and you can't fix that by handing developers a Terraform module and calling it ownership. The practice is: here's how you encode SRE paranoia into automated gates so the system enforces what culture can't.&lt;br&gt;
The Core Problem With "Shift Left"&lt;br&gt;
A shift-left mindset means SREs can embed reliability principles from Dev to Ops, baking reliability and resiliency into each process, app, and code change. That's the theory. The practice is that "baking reliability in" almost always means asking developers to think like SREs — to internalize failure mode thinking as a natural habit. Dynatrace&lt;br&gt;
That doesn't work at scale. It works for the senior engineer who has been paged at 3 AM enough times. It doesn't work for the engineer who has never been on-call for a service they built.&lt;br&gt;
In a traditional DevOps environment, the developer who wrote the code is often the one paged, focusing on a hotfix to restore the pipeline. In an SRE-driven environment, the SRE team manages the incident using a pre-defined playbook, focusing on automated remediation to bring the system back within its SLO parameters while the developers continue their sprint. Full-Stack Techies&lt;br&gt;
The SRE-driven model works because the SRE's paranoia is encoded into the playbook, the SLO, and the error budget — not because the developer has learned to think differently. The developer doesn't need to internalize the mindset. The system has already internalized it for them.&lt;br&gt;
That's the design principle. Encode the paranoia. Don't outsource it to culture.&lt;br&gt;
What SRE Paranoia Actually Looks Like in Code&lt;br&gt;
An SRE reviewing a production change asks five questions that a developer typically doesn't:&lt;br&gt;
What's the rollback path, and have we tested it recently? Not "does a rollback exist" — does the team have a recent drill showing it takes under 5 minutes?&lt;br&gt;
What happens during deployment, not just after? Connections in flight. Transactions mid-way through. Cache state that doesn't match the new schema.&lt;br&gt;
What does this change look like at 10x load with one dependency degraded? Not at nominal load with everything healthy.&lt;br&gt;
What's the blast radius if this goes wrong at 2 AM with one engineer on-call? Can one person contain it, or does it need three?&lt;br&gt;
What's the SLO burn rate impact of a 1% error rate for 10 minutes? Does the team have that calculation, or will they be running it during the incident?&lt;br&gt;
None of these questions require the developer to become an SRE. They require the CI/CD pipeline to refuse to proceed until these questions are answered — in code, not in conversation.&lt;br&gt;
Introducing SRE Paranoia Gates&lt;br&gt;
An SRE Paranoia Gate is an automated production readiness check that encodes one specific failure-mode question as a machine-enforceable constraint. The gate runs in CI/CD. It produces a pass/fail signal. It has a named owner — an SRE who wrote it and is accountable for its accuracy.&lt;br&gt;
Each gate maps to one of the five questions above:&lt;br&gt;
python# agentsre/sre_paranoia_gates.py&lt;/p&gt;

&lt;p&gt;from dataclasses import dataclass, field&lt;br&gt;
from typing import List, Optional, Callable&lt;br&gt;
from datetime import datetime, timezone, timedelta&lt;br&gt;
from enum import Enum&lt;br&gt;
import json&lt;/p&gt;

&lt;p&gt;class GateResult(Enum):&lt;br&gt;
    PASS   = "pass"&lt;br&gt;
    FAIL   = "fail"&lt;br&gt;
    WARN   = "warn"   # Won't block but logs for SRE review&lt;/p&gt;

&lt;p&gt;@dataclass&lt;br&gt;
class ParanoiaGateCheck:&lt;br&gt;
    """&lt;br&gt;
    Result of one SRE Paranoia Gate evaluation.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Every gate check is logged — pass or fail.
Passes build confidence. Failures block deployment.
Warns surface to SRE review without blocking.
"""
gate_id: str
gate_name: str
result: GateResult
reason: str
sre_owner: str
evidence: Optional[dict] = None
checked_at: str = field(
    default_factory=lambda: datetime.now(timezone.utc).isoformat()
)

def to_dict(self) -&amp;gt; dict:
    return {
        "gate_id": self.gate_id,
        "gate_name": self.gate_name,
        "result": self.result.value,
        "reason": self.reason,
        "sre_owner": self.sre_owner,
        "evidence": self.evidence,
        "checked_at": self.checked_at,
    }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;class SREParanoiaGateRunner:&lt;br&gt;
    """&lt;br&gt;
    Run all registered SRE Paranoia Gates before a production deployment.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Gates encode the five questions a seasoned SRE asks before any change:
    Gate 1: Is the rollback path tested and under 5 minutes?
    Gate 2: Is in-flight traffic handled during deployment?
    Gate 3: Has this been tested at degraded-dependency load?
    Gate 4: Is the blast radius survivable by one on-call engineer?
    Gate 5: Is the SLO error budget healthy enough to absorb a 1% error rate?

All gates must pass for deployment to proceed.
WARN gates surface to SRE review but don't block.
One FAIL blocks the entire deployment.
"""

def __init__(self, service_name: str, sre_owner: str):
    self.service_name = service_name
    self.sre_owner = sre_owner
    self._gates: List[Callable] = []

def register(self, gate_fn: Callable) -&amp;gt; None:
    """Register a gate function. Each gate returns ParanoiaGateCheck."""
    self._gates.append(gate_fn)

def run_all(self, context: dict) -&amp;gt; dict:
    """
    Run all registered gates against deployment context.

    Args:
        context: Deployment metadata including service config,
                 rollback info, blast radius, SLO state, load test results.

    Returns:
        Summary with all gate results and deployment decision.
    """
    results = []
    blocked = False

    for gate_fn in self._gates:
        try:
            check = gate_fn(context, self.sre_owner)
            results.append(check)
            if check.result == GateResult.FAIL:
                blocked = True
        except Exception as e:
            # Gate evaluation failure = FAIL, not skip
            results.append(ParanoiaGateCheck(
                gate_id="gate_error",
                gate_name=gate_fn.__name__,
                result=GateResult.FAIL,
                reason=f"Gate evaluation raised exception: {str(e)}",
                sre_owner=self.sre_owner,
                evidence={"exception": str(e)}
            ))
            blocked = True

    return {
        "service": self.service_name,
        "deployment_approved": not blocked,
        "gates_run": len(results),
        "gates_passed": sum(1 for r in results if r.result == GateResult.PASS),
        "gates_failed": sum(1 for r in results if r.result == GateResult.FAIL),
        "gates_warned": sum(1 for r in results if r.result == GateResult.WARN),
        "results": [r.to_dict() for r in results],
        "evaluated_at": datetime.now(timezone.utc).isoformat(),
        "decision": (
            "APPROVED — all SRE paranoia gates passed."
            if not blocked
            else "BLOCKED — one or more SRE paranoia gates failed. "
                 "Fix the flagged conditions before proceeding."
        )
    }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The Five Gates — Implemented&lt;br&gt;
python# Gate 1: Rollback tested and under 5 minutes&lt;br&gt;
def gate_rollback_readiness(context: dict, owner: str) -&amp;gt; ParanoiaGateCheck:&lt;br&gt;
    """&lt;br&gt;
    Developers assume rollback exists. SREs verify it works in under 5 minutes.&lt;br&gt;
    The question isn't 'do we have a rollback' — it's 'did we test it recently?'&lt;br&gt;
    """&lt;br&gt;
    rollback = context.get("rollback", {})&lt;br&gt;
    last_drill_date = rollback.get("last_tested_at")&lt;br&gt;
    p95_minutes = rollback.get("p95_duration_minutes")&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;if not last_drill_date:
    return ParanoiaGateCheck(
        gate_id="SPG-001",
        gate_name="Rollback Readiness",
        result=GateResult.FAIL,
        reason="No rollback drill date recorded. Rollbacks untested are rollbacks that fail at 3 AM.",
        sre_owner=owner
    )

days_since = (datetime.now(timezone.utc) 
              - datetime.fromisoformat(last_drill_date)).days

if days_since &amp;gt; 30:
    return ParanoiaGateCheck(
        gate_id="SPG-001",
        gate_name="Rollback Readiness",
        result=GateResult.FAIL,
        reason=f"Rollback last tested {days_since} days ago. Require &amp;lt; 30 days.",
        sre_owner=owner,
        evidence={"last_tested_at": last_drill_date, "days_since": days_since}
    )

if p95_minutes and p95_minutes &amp;gt; 5:
    return ParanoiaGateCheck(
        gate_id="SPG-001",
        gate_name="Rollback Readiness",
        result=GateResult.FAIL,
        reason=f"Rollback p95 is {p95_minutes}m. SRE target is &amp;lt; 5 minutes.",
        sre_owner=owner,
        evidence={"p95_minutes": p95_minutes}
    )

return ParanoiaGateCheck(
    gate_id="SPG-001",
    gate_name="Rollback Readiness",
    result=GateResult.PASS,
    reason=f"Rollback tested {days_since} days ago, p95 {p95_minutes}m.",
    sre_owner=owner
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h1&gt;
  
  
  Gate 2: In-flight traffic handling
&lt;/h1&gt;

&lt;p&gt;def gate_in_flight_traffic(context: dict, owner: str) -&amp;gt; ParanoiaGateCheck:&lt;br&gt;
    """&lt;br&gt;
    Developers test what happens after deployment. SREs test what happens during.&lt;br&gt;
    Connections in flight, transactions mid-way, cache state mismatches.&lt;br&gt;
    """&lt;br&gt;
    deploy = context.get("deployment", {})&lt;br&gt;
    graceful_shutdown = deploy.get("graceful_shutdown_seconds", 0)&lt;br&gt;
    drain_configured = deploy.get("connection_draining_enabled", False)&lt;br&gt;
    schema_backward_compat = deploy.get("schema_backward_compatible", None)&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;failures = []

if graceful_shutdown &amp;lt; 30:
    failures.append(
        f"Graceful shutdown is {graceful_shutdown}s — "
        "recommend minimum 30s for in-flight requests to complete."
    )

if not drain_configured:
    failures.append(
        "Connection draining not enabled — "
        "load balancer will drop in-flight requests on pod termination."
    )

if schema_backward_compat is False:
    failures.append(
        "Schema change is NOT backward compatible — "
        "both old and new code will be running simultaneously during rollout. "
        "Migration plan required."
    )

if failures:
    return ParanoiaGateCheck(
        gate_id="SPG-002",
        gate_name="In-Flight Traffic Safety",
        result=GateResult.FAIL,
        reason=" | ".join(failures),
        sre_owner=owner,
        evidence={"deployment_config": deploy}
    )

return ParanoiaGateCheck(
    gate_id="SPG-002",
    gate_name="In-Flight Traffic Safety",
    result=GateResult.PASS,
    reason="Graceful shutdown, connection draining, schema compatibility verified.",
    sre_owner=owner
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h1&gt;
  
  
  Gate 3: Degraded-dependency load testing
&lt;/h1&gt;

&lt;p&gt;def gate_degraded_load_test(context: dict, owner: str) -&amp;gt; ParanoiaGateCheck:&lt;br&gt;
    """&lt;br&gt;
    Developers test at nominal load with all dependencies healthy.&lt;br&gt;
    SREs test at peak load with one critical dependency degraded.&lt;br&gt;
    The real question: what happens when the upstream rate-limits you &lt;br&gt;
    during a retry storm you caused?&lt;br&gt;
    """&lt;br&gt;
    load_test = context.get("load_testing", {})&lt;br&gt;
    tested_at_peak = load_test.get("tested_at_peak_load", False)&lt;br&gt;
    tested_with_degraded_dep = load_test.get("tested_with_degraded_dependency", False)&lt;br&gt;
    retry_storm_tested = load_test.get("retry_storm_simulation", False)&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;if not tested_at_peak:
    return ParanoiaGateCheck(
        gate_id="SPG-003",
        gate_name="Degraded Load Testing",
        result=GateResult.FAIL,
        reason="No peak load test recorded. Staging at 10% traffic ≠ production at 100%.",
        sre_owner=owner
    )

if not tested_with_degraded_dep:
    return ParanoiaGateCheck(
        gate_id="SPG-003",
        gate_name="Degraded Load Testing",
        result=GateResult.WARN,
        reason=(
            "Peak load tested but not with a degraded dependency. "
            "Add fault injection to your load test suite. "
            "This is a warning — but it's where most production incidents live."
        ),
        sre_owner=owner
    )

return ParanoiaGateCheck(
    gate_id="SPG-003",
    gate_name="Degraded Load Testing",
    result=GateResult.PASS,
    reason="Peak load and degraded-dependency scenarios both tested.",
    sre_owner=owner,
    evidence={"load_test_config": load_test}
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h1&gt;
  
  
  Gate 4: Blast radius survivability
&lt;/h1&gt;

&lt;p&gt;def gate_blast_radius_survivability(context: dict, owner: str) -&amp;gt; ParanoiaGateCheck:&lt;br&gt;
    """&lt;br&gt;
    The SRE question: if this goes wrong at 2 AM with one engineer on-call,&lt;br&gt;
    can they contain it alone? Or does it need three engineers and a war room?&lt;br&gt;
    """&lt;br&gt;
    blast = context.get("blast_radius", {})&lt;br&gt;
    downstream_count = blast.get("downstream_service_count", 0)&lt;br&gt;
    contains_payment_path = blast.get("contains_payment_path", False)&lt;br&gt;
    single_engineer_containable = blast.get("single_engineer_containable", None)&lt;br&gt;
    has_circuit_breaker = blast.get("circuit_breaker_configured", False)&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;if not has_circuit_breaker and downstream_count &amp;gt; 3:
    return ParanoiaGateCheck(
        gate_id="SPG-004",
        gate_name="Blast Radius Survivability",
        result=GateResult.FAIL,
        reason=(
            f"Service has {downstream_count} downstream dependencies "
            "and no circuit breaker. "
            "Failure will cascade. One engineer cannot contain this at 2 AM."
        ),
        sre_owner=owner,
        evidence={"downstream_count": downstream_count}
    )

if contains_payment_path and single_engineer_containable is False:
    return ParanoiaGateCheck(
        gate_id="SPG-004",
        gate_name="Blast Radius Survivability",
        result=GateResult.FAIL,
        reason=(
            "Change touches payment path and requires multiple engineers to contain. "
            "Require change window with full team available."
        ),
        sre_owner=owner
    )

return ParanoiaGateCheck(
    gate_id="SPG-004",
    gate_name="Blast Radius Survivability",
    result=GateResult.PASS,
    reason="Blast radius survivable by on-call rotation. Circuit breakers configured.",
    sre_owner=owner
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h1&gt;
  
  
  Gate 5: SLO error budget check
&lt;/h1&gt;

&lt;p&gt;def gate_error_budget(context: dict, owner: str) -&amp;gt; ParanoiaGateCheck:&lt;br&gt;
    """&lt;br&gt;
    The question developers don't ask: how much error budget remains?&lt;br&gt;
    If budget is &amp;lt; 20%, shipping anything non-trivial is a reliability bet&lt;br&gt;
    the team hasn't explicitly made.&lt;br&gt;
    """&lt;br&gt;
    slo = context.get("slo", {})&lt;br&gt;
    budget_remaining_pct = slo.get("error_budget_remaining_pct", 100.0)&lt;br&gt;
    is_critical_change = context.get("is_critical_change", False)&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;if budget_remaining_pct &amp;lt; 5.0:
    return ParanoiaGateCheck(
        gate_id="SPG-005",
        gate_name="SLO Error Budget",
        result=GateResult.FAIL,
        reason=(
            f"Error budget at {budget_remaining_pct:.1f}% — critically low. "
            "No changes permitted until budget recovers. "
            "This is not optional. The SLO contract with your users says so."
        ),
        sre_owner=owner,
        evidence={"budget_remaining_pct": budget_remaining_pct}
    )

if budget_remaining_pct &amp;lt; 20.0 and is_critical_change:
    return ParanoiaGateCheck(
        gate_id="SPG-005",
        gate_name="SLO Error Budget",
        result=GateResult.FAIL,
        reason=(
            f"Error budget at {budget_remaining_pct:.1f}% and change is marked critical. "
            "Require SRE lead approval before proceeding."
        ),
        sre_owner=owner
    )

if budget_remaining_pct &amp;lt; 20.0:
    return ParanoiaGateCheck(
        gate_id="SPG-005",
        gate_name="SLO Error Budget",
        result=GateResult.WARN,
        reason=(
            f"Error budget at {budget_remaining_pct:.1f}%. "
            "Below 20% — proceed with caution. "
            "SRE team should be aware this change is consuming headroom."
        ),
        sre_owner=owner
    )

return ParanoiaGateCheck(
    gate_id="SPG-005",
    gate_name="SLO Error Budget",
    result=GateResult.PASS,
    reason=f"Error budget at {budget_remaining_pct:.1f}% — sufficient headroom.",
    sre_owner=owner
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Running the Full Gate Suite&lt;br&gt;
python# Usage in your CI/CD pipeline&lt;/p&gt;

&lt;p&gt;runner = SREParanoiaGateRunner(&lt;br&gt;
    service_name="payments-service",&lt;br&gt;
    sre_owner="platform-sre-team"&lt;br&gt;
)&lt;/p&gt;

&lt;h1&gt;
  
  
  Register all five gates
&lt;/h1&gt;

&lt;p&gt;runner.register(gate_rollback_readiness)&lt;br&gt;
runner.register(gate_in_flight_traffic)&lt;br&gt;
runner.register(gate_degraded_load_test)&lt;br&gt;
runner.register(gate_blast_radius_survivability)&lt;br&gt;
runner.register(gate_error_budget)&lt;/p&gt;

&lt;h1&gt;
  
  
  Build context from deployment metadata
&lt;/h1&gt;

&lt;p&gt;context = {&lt;br&gt;
    "rollback": {&lt;br&gt;
        "last_tested_at": "2026-06-01T09:00:00Z",&lt;br&gt;
        "p95_duration_minutes": 3.5&lt;br&gt;
    },&lt;br&gt;
    "deployment": {&lt;br&gt;
        "graceful_shutdown_seconds": 60,&lt;br&gt;
        "connection_draining_enabled": True,&lt;br&gt;
        "schema_backward_compatible": True&lt;br&gt;
    },&lt;br&gt;
    "load_testing": {&lt;br&gt;
        "tested_at_peak_load": True,&lt;br&gt;
        "tested_with_degraded_dependency": False,  # WARN&lt;br&gt;
        "retry_storm_simulation": False&lt;br&gt;
    },&lt;br&gt;
    "blast_radius": {&lt;br&gt;
        "downstream_service_count": 4,&lt;br&gt;
        "contains_payment_path": True,&lt;br&gt;
        "single_engineer_containable": True,&lt;br&gt;
        "circuit_breaker_configured": True&lt;br&gt;
    },&lt;br&gt;
    "slo": {&lt;br&gt;
        "error_budget_remaining_pct": 42.0&lt;br&gt;
    },&lt;br&gt;
    "is_critical_change": False&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;result = runner.run_all(context)&lt;br&gt;
print(json.dumps(result, indent=2))&lt;/p&gt;

&lt;h1&gt;
  
  
  In CI/CD: fail the pipeline if not approved
&lt;/h1&gt;

&lt;p&gt;if not result["deployment_approved"]:&lt;br&gt;
    raise SystemExit("SRE Paranoia Gates blocked deployment. See gate results above.")&lt;br&gt;
Why This Works Better Than Culture&lt;br&gt;
The teams winning on reliability in 2026 are not the ones with the most sophisticated AI stack. They are the ones that paired intelligent tooling with genuine engineering culture and did the hard work of changing how ownership flows, not just how alerts fire. Sherlocks AI&lt;br&gt;
SRE Paranoia Gates are how ownership flows change in practice. The SRE doesn't have to be in every deployment review meeting. The SRE's questions are already in the pipeline. A developer shipping a change either answers them — by filling in the deployment context — or the gate blocks the deployment and the developer learns why those questions matter.&lt;br&gt;
That's shift left done correctly. Not "teach developers to think like SREs." Encode what SREs think into the system itself.&lt;br&gt;
The code is in agentsre/sre_paranoia_gates.py. MIT licensed.&lt;br&gt;
Ajay Devineni | AWS Community Builder | Senior SRE/Platform Engineer&lt;/p&gt;

&lt;p&gt;github.com/Ajay150313/agentsre | dev.to/ajaydevineni&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fmetnnmm1dchj1ho24e2p.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fmetnnmm1dchj1ho24e2p.jpeg" alt=" " width="800" height="1055"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>sre</category>
      <category>aws</category>
    </item>
    <item>
      <title>Your 99.99% Uptime SLO Is Probably a Lie — Here's How to Fix It</title>
      <dc:creator>Ajay Devineni</dc:creator>
      <pubDate>Fri, 19 Jun 2026 03:09:38 +0000</pubDate>
      <link>https://dev.to/ajaydevineni/your-9999-uptime-slo-is-probably-a-lie-heres-how-to-fix-it-2i6o</link>
      <guid>https://dev.to/ajaydevineni/your-9999-uptime-slo-is-probably-a-lie-heres-how-to-fix-it-2i6o</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fzi8g2epksk9mzzwhcn15.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fzi8g2epksk9mzzwhcn15.jpeg" alt=" " width="800" height="1067"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I've been in enough postmortems to know the pattern.&lt;/p&gt;

&lt;p&gt;The executive slide says 99.99% uptime. The on-call engineer who&lt;br&gt;
lived through the last three months knows it's closer to 99.7%.&lt;br&gt;
Neither number is wrong exactly. They're measuring completely&lt;br&gt;
different things — and that gap is where trust goes to die.&lt;/p&gt;

&lt;p&gt;This is not a theoretical problem. Let me show you the math,&lt;br&gt;
the code that catches it, and the cultural shift that fixes it.&lt;/p&gt;

&lt;p&gt;The math nobody does in public&lt;/p&gt;

&lt;p&gt;Four nines sounds impressive until you look at what it actually allows:&lt;/p&gt;

&lt;p&gt;99.99% uptime over 365 days =&lt;br&gt;
  Total minutes in a year:     525,600&lt;br&gt;
  Allowed downtime (0.01%):        52.6 minutes per year&lt;br&gt;
  Per month:                        4.4 minutes&lt;br&gt;
  Per week:                         1.0 minute&lt;/p&gt;

&lt;p&gt;Now count what your team actually experienced last year:&lt;/p&gt;

&lt;p&gt;The "elevated error rates" incident that lasted 40 minutes&lt;br&gt;
but got logged as "partial degradation, not full outage"&lt;br&gt;
The auth provider hiccup that dropped regional logins for 22 minutes&lt;br&gt;
but was excluded because "that's a third-party dependency"&lt;br&gt;
The Friday night patch that broke a non-critical service&lt;br&gt;
that everyone actually depends on — 90 minutes, excluded as&lt;br&gt;
"planned maintenance window"&lt;/p&gt;

&lt;p&gt;That's 152 minutes. You've already spent 2.9× your entire annual budget&lt;br&gt;
by March, and your dashboard still shows four nines.&lt;/p&gt;

&lt;p&gt;This is not fraud exactly. It's a measurement convention that happens&lt;br&gt;
to always favor the measurer.&lt;/p&gt;

&lt;p&gt;The three ways SLOs get gamed (usually unintentionally)&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Excluding third-party dependencies&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;"The CDN was down, not us" is technically true and practically useless&lt;br&gt;
to your customer who couldn't load your application.&lt;/p&gt;

&lt;p&gt;User-experienced availability includes everything in the request path.&lt;br&gt;
If your SLO excludes your auth provider, your DNS, and your CDN,&lt;br&gt;
you are measuring something your users have never experienced.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Averaging across regions&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;"We maintained 99.99% globally" while US-East was down for 35 minutes&lt;br&gt;
is one of the most common forms of SLO theater in distributed systems.&lt;/p&gt;

&lt;p&gt;A customer in us-east-1 experienced a complete outage.&lt;br&gt;
Your global aggregate made it disappear.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The planned maintenance carve-out&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This one is the most honest-sounding and the most problematic.&lt;/p&gt;

&lt;p&gt;If you need a maintenance window to deploy safely,&lt;br&gt;
your deployment process is part of your reliability problem.&lt;br&gt;
Excluding it from your SLO means you never fix it.&lt;br&gt;
You just keep scheduling windows.&lt;/p&gt;

&lt;p&gt;What honest SLO tracking looks like in code&lt;/p&gt;

&lt;p&gt;Here's the implementation I use. It measures what users experience,&lt;br&gt;
not what the infrastructure team finds convenient to measure.&lt;/p&gt;

&lt;p&gt;python# honest_slo.py&lt;/p&gt;

&lt;h1&gt;
  
  
  MIT License — Ajay Devineni (github.com/Ajay150313)
&lt;/h1&gt;

&lt;p&gt;from dataclasses import dataclass, field&lt;br&gt;
from datetime import datetime, timedelta, timezone&lt;br&gt;
from enum import Enum&lt;br&gt;
from typing import Optional&lt;br&gt;
import json&lt;/p&gt;

&lt;p&gt;class ExclusionPolicy(Enum):&lt;br&gt;
    """&lt;br&gt;
    Controls whether incidents can be excluded from SLO calculation.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;STRICT:      Nothing excluded. User experience only.
STANDARD:    Excludes pre-announced maintenance with customer notification.
PERMISSIVE:  Excludes third-party, maintenance, and regional partial impact.
             (This is how most teams get to four nines on paper.)
"""
STRICT = "strict"
STANDARD = "standard"
PERMISSIVE = "permissive"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;@dataclass&lt;br&gt;
class Incident:&lt;br&gt;
    """&lt;br&gt;
    A single reliability incident, recorded at the moment it is detected.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;The 'excluded_reason' field is the audit trail.
Every exclusion must be justified in writing at the time it happens,
not retroactively cleaned up before the monthly review.
"""
incident_id: str
started_at: datetime
resolved_at: Optional[datetime]
severity: str                    # P1, P2, P3
impact_regions: list[str]        # actual affected regions, not "global"
impact_scope: str                # "full_outage" | "partial_degradation" | "elevated_errors"
root_cause_category: str         # "internal" | "third_party" | "infrastructure"
was_planned: bool = False
customer_notified_before: bool = False  # for planned maintenance
excluded_reason: Optional[str] = None  # must be filled at time of exclusion
excluded_by: Optional[str] = None      # who approved the exclusion

@property
def duration_minutes(self) -&amp;gt; float:
    if self.resolved_at is None:
        return (datetime.now(timezone.utc) - self.started_at).total_seconds() / 60
    return (self.resolved_at - self.started_at).total_seconds() / 60

@property
def is_open(self) -&amp;gt; bool:
    return self.resolved_at is None
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;@dataclass&lt;br&gt;
class SLOCalculator:&lt;br&gt;
    """&lt;br&gt;
    Calculates SLO compliance from a list of incidents.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;The key design decision: the exclusion policy is set at calculator
construction and applied consistently. You cannot change the policy
after the fact to make your numbers look better.

Usage:
    calc = SLOCalculator(
        target_pct=99.9,
        window_days=30,
        policy=ExclusionPolicy.STANDARD,
    )
    result = calc.calculate(incidents)
"""
target_pct: float           # e.g. 99.9
window_days: int            # rolling window
policy: ExclusionPolicy = ExclusionPolicy.STANDARD
service_regions: list[str] = field(default_factory=lambda: ["global"])

@property
def window_minutes(self) -&amp;gt; float:
    return self.window_days * 24 * 60

@property
def error_budget_minutes(self) -&amp;gt; float:
    return self.window_minutes * (1 - self.target_pct / 100)

def _should_exclude(self, incident: Incident) -&amp;gt; tuple[bool, str]:
    """
    Returns (should_exclude, reason).

    The reason is logged regardless — so you can audit what
    would have been excluded under a stricter policy.
    """
    if incident.excluded_reason:
        # Explicit manual exclusion — always honored if policy allows it
        if self.policy == ExclusionPolicy.STRICT:
            return False, "strict policy: no exclusions allowed"
        return True, incident.excluded_reason

    if self.policy == ExclusionPolicy.STRICT:
        return False, ""

    # STANDARD: only exclude pre-announced maintenance
    # with documented customer notification
    if (self.policy == ExclusionPolicy.STANDARD
            and incident.was_planned
            and incident.customer_notified_before):
        return True, "pre-announced maintenance with customer notification"

    if self.policy == ExclusionPolicy.PERMISSIVE:
        if incident.root_cause_category == "third_party":
            return True, "third-party dependency (permissive policy)"
        if incident.was_planned:
            return True, "planned maintenance (permissive policy)"
        # Partial regional impact excluded under permissive
        if (incident.impact_scope == "partial_degradation"
                and len(incident.impact_regions) &amp;lt; len(self.service_regions)):
            return True, "partial regional impact (permissive policy)"

    return False, ""

def calculate(self, incidents: list[Incident]) -&amp;gt; dict:
    """
    Returns a complete SLO report including:
    - Compliance under the configured policy
    - What it would be under STRICT policy (no exclusions)
    - The gap between them (the "honesty gap")
    - Full audit trail of all exclusions
    """
    cutoff = datetime.now(timezone.utc) - timedelta(days=self.window_days)
    window_incidents = [
        i for i in incidents
        if i.started_at &amp;gt;= cutoff
    ]

    included_minutes = 0.0
    excluded_minutes = 0.0
    exclusion_log = []
    strict_downtime = 0.0

    for incident in window_incidents:
        duration = incident.duration_minutes
        strict_downtime += duration  # always count for strict calculation

        should_exclude, reason = self._should_exclude(incident)
        if should_exclude:
            excluded_minutes += duration
            exclusion_log.append({
                "incident_id": incident.incident_id,
                "duration_minutes": round(duration, 1),
                "reason": reason,
                "excluded_by": incident.excluded_by,
            })
        else:
            included_minutes += duration

    # Compliance under configured policy
    downtime_ratio = included_minutes / self.window_minutes
    achieved_pct = (1 - downtime_ratio) * 100
    budget_consumed_pct = (included_minutes / self.error_budget_minutes) * 100

    # What it would look like with zero exclusions (honest number)
    strict_ratio = strict_downtime / self.window_minutes
    strict_pct = (1 - strict_ratio) * 100
    honesty_gap = achieved_pct - strict_pct

    # Budget status
    budget_remaining = max(0.0, self.error_budget_minutes - included_minutes)
    is_breached = included_minutes &amp;gt; self.error_budget_minutes

    return {
        "slo_target_pct": self.target_pct,
        "window_days": self.window_days,
        "policy": self.policy.value,
        "achieved_pct": round(achieved_pct, 4),
        "strict_achieved_pct": round(strict_pct, 4),
        "honesty_gap_pct": round(honesty_gap, 4),
        "error_budget_minutes_total": round(self.error_budget_minutes, 1),
        "error_budget_minutes_consumed": round(included_minutes, 1),
        "error_budget_minutes_remaining": round(budget_remaining, 1),
        "error_budget_consumed_pct": round(budget_consumed_pct, 1),
        "is_breached": is_breached,
        "total_incidents": len(window_incidents),
        "excluded_incidents": len(exclusion_log),
        "excluded_minutes": round(excluded_minutes, 1),
        "exclusion_log": exclusion_log,
    }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;@dataclass&lt;br&gt;&lt;br&gt;
class ErrorBudgetBurnTracker:&lt;br&gt;
    """&lt;br&gt;
    Tracks error budget consumption rate over time.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;The key insight: it's not just how much budget you've used,
it's how fast you're burning it. A team that burns 80% of their
budget in the first week of the month has a very different problem
than a team that burns it steadily.

Burn rate &amp;gt; 1.0 means you will exhaust the budget before
the window closes at the current rate.
"""
slo_target_pct: float
window_days: int

@property
def _budget_minutes(self) -&amp;gt; float:
    return self.window_days * 24 * 60 * (1 - self.slo_target_pct / 100)

def current_burn_rate(
    self,
    consumed_minutes: float,
    elapsed_days: float,
) -&amp;gt; float:
    """
    Burn rate of 1.0 = consuming budget at exactly the sustainable pace.
    Burn rate of 3.0 = will exhaust budget in 1/3 the remaining time.
    Burn rate of 14.4 = exhausts 30-day budget in ~50 hours (page-worthy).
    """
    if elapsed_days &amp;lt;= 0:
        return 0.0
    elapsed_minutes = elapsed_days * 24 * 60
    actual_rate = consumed_minutes / elapsed_minutes
    sustainable_rate = self._budget_minutes / (self.window_days * 24 * 60)
    if sustainable_rate &amp;lt;= 0:
        return 0.0
    return actual_rate / sustainable_rate

def time_to_exhaustion_hours(
    self,
    remaining_budget_minutes: float,
    current_burn_rate: float,
) -&amp;gt; Optional[float]:
    """
    At the current burn rate, how many hours until the budget is gone?
    Returns None if burn rate &amp;lt;= 1.0 (on track or ahead of pace).
    """
    if current_burn_rate &amp;lt;= 1.0 or remaining_budget_minutes &amp;lt;= 0:
        return None
    sustainable_minutes_per_hour = (
        self._budget_minutes / (self.window_days * 24)
    )
    actual_minutes_per_hour = sustainable_minutes_per_hour * current_burn_rate
    if actual_minutes_per_hour &amp;lt;= 0:
        return None
    return remaining_budget_minutes / actual_minutes_per_hour
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;def format_report(report: dict) -&amp;gt; str:&lt;br&gt;
    """Print a human-readable SLO compliance report."""&lt;br&gt;
    breach_str = "🔴 BREACHED" if report["is_breached"] else "🟢 Within budget"&lt;br&gt;
    gap_str = (&lt;br&gt;
        f"⚠️  +{report['honesty_gap_pct']:.3f}% vs strict calculation"&lt;br&gt;
        if report["honesty_gap_pct"] &amp;gt; 0.001&lt;br&gt;
        else "✅ No exclusions applied"&lt;br&gt;
    )&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;lines = [
    f"\n{'═'*56}",
    f"  SLO COMPLIANCE REPORT ({report['window_days']}-day window)",
    f"{'═'*56}",
    f"  Policy:              {report['policy'].upper()}",
    f"  Target:              {report['slo_target_pct']}%",
    f"  Achieved:            {report['achieved_pct']}%  {breach_str}",
    f"  Strict (no excl.):   {report['strict_achieved_pct']}%  {gap_str}",
    f"{'─'*56}",
    f"  Error budget total:  {report['error_budget_minutes_total']} min",
    f"  Consumed:            {report['error_budget_minutes_consumed']} min  "
    f"({report['error_budget_consumed_pct']}%)",
    f"  Remaining:           {report['error_budget_minutes_remaining']} min",
    f"{'─'*56}",
    f"  Incidents (window):  {report['total_incidents']}",
    f"  Excluded:            {report['excluded_incidents']} "
    f"({report['excluded_minutes']} min)",
]

if report["exclusion_log"]:
    lines.append(f"\n  Exclusion audit trail:")
    for excl in report["exclusion_log"]:
        lines.append(
            f"    • {excl['incident_id']}: {excl['duration_minutes']} min — "
            f"{excl['reason']}"
        )
lines.append(f"{'═'*56}\n")
return "\n".join(lines)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h1&gt;
  
  
  ── Example usage ─────────────────────────────────────────────────────────────
&lt;/h1&gt;

&lt;p&gt;if &lt;strong&gt;name&lt;/strong&gt; == "&lt;strong&gt;main&lt;/strong&gt;":&lt;br&gt;
    from datetime import timezone&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;now = datetime.now(timezone.utc)

incidents = [
    Incident(
        incident_id="INC-2026-001",
        started_at=now - timedelta(days=25, hours=2),
        resolved_at=now - timedelta(days=25, hours=1, minutes=20),
        severity="P1",
        impact_regions=["us-east-1"],
        impact_scope="full_outage",
        root_cause_category="internal",
    ),
    Incident(
        incident_id="INC-2026-002",
        started_at=now - timedelta(days=18),
        resolved_at=now - timedelta(days=17, hours=23, minutes=37),
        severity="P2",
        impact_regions=["us-east-1", "eu-west-1"],
        impact_scope="elevated_errors",
        root_cause_category="third_party",
        excluded_reason="Auth provider outage — not our infrastructure",
        excluded_by="oncall-lead@company.com",
    ),
    Incident(
        incident_id="INC-2026-003",
        started_at=now - timedelta(days=5, hours=22),
        resolved_at=now - timedelta(days=5, hours=20, minutes=30),
        severity="P2",
        impact_regions=["us-east-1"],
        impact_scope="partial_degradation",
        root_cause_category="internal",
        was_planned=True,
        customer_notified_before=True,
    ),
]

print("\n--- PERMISSIVE (how most teams report) ---")
calc_permissive = SLOCalculator(
    target_pct=99.9,
    window_days=30,
    policy=ExclusionPolicy.PERMISSIVE,
    service_regions=["us-east-1", "eu-west-1"],
)
print(format_report(calc_permissive.calculate(incidents)))

print("--- STANDARD (honest, defensible) ---")
calc_standard = SLOCalculator(
    target_pct=99.9,
    window_days=30,
    policy=ExclusionPolicy.STANDARD,
    service_regions=["us-east-1", "eu-west-1"],
)
print(format_report(calc_standard.calculate(incidents)))

print("--- STRICT (pure user experience) ---")
calc_strict = SLOCalculator(
    target_pct=99.9,
    window_days=30,
    policy=ExclusionPolicy.STRICT,
    service_regions=["us-east-1", "eu-west-1"],
)
print(format_report(calc_strict.calculate(incidents)))

# Burn rate check
tracker = ErrorBudgetBurnTracker(slo_target_pct=99.9, window_days=30)
burn = tracker.current_burn_rate(consumed_minutes=85.0, elapsed_days=15)
hours_left = tracker.time_to_exhaustion_hours(
    remaining_budget_minutes=max(0, 43.2 - 85.0), current_burn_rate=burn
)
print(f"  Current burn rate: {burn:.2f}×")
if hours_left:
    print(f"  Budget exhaustion: {hours_left:.1f} hours at current rate")
else:
    print(f"  Budget already exhausted (consumed 85.0 of 43.2 min budget)"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>aiops</category>
      <category>sre</category>
      <category>devops</category>
      <category>aws</category>
    </item>
    <item>
      <title>AWS DevOps Agent Is GA And the Hardest Problem Isn't the Agent. It's What Happens to Your Team Six Months Later.</title>
      <dc:creator>Ajay Devineni</dc:creator>
      <pubDate>Tue, 16 Jun 2026 03:20:51 +0000</pubDate>
      <link>https://dev.to/ajaydevineni/aws-devops-agent-is-ga-and-the-hardest-problem-isnt-the-agent-its-what-happens-to-your-team-six-3218</link>
      <guid>https://dev.to/ajaydevineni/aws-devops-agent-is-ga-and-the-hardest-problem-isnt-the-agent-its-what-happens-to-your-team-six-3218</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsw5ffoqk3x69ngea574w.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsw5ffoqk3x69ngea574w.jpeg" alt=" " width="800" height="1200"&gt;&lt;/a&gt;&lt;br&gt;
AWS DevOps Agent went generally available this week. It's a frontier agent — autonomous, massively scalable, designed to work for hours or days without constant intervention. It analyzes data across monitoring tools, reviews recent deployments, and coordinates incident response. incident.io&lt;br&gt;
I've been building the SRE governance framework for exactly this class of agent for five months. Eighteen posts, an open-source library, an arXiv paper, a DynamoDB-backed ARO registry, a Pre-Action SRE Gate, an EvalPipeline that runs nightly DQR checks.&lt;br&gt;
All of that is the technical governance layer. This post is about the human governance layer — the one that fails quietly and shows up in your postmortem eighteen months after you deployed the agent.&lt;br&gt;
The Pattern&lt;br&gt;
The agent failed confidently, without signaling uncertainty, and the humans around it had gradually stopped watching. We have decades of research on automation complacency in aviation and industrial control systems. The failure mode is well understood: when automation performs reliably for an extended period, human operators reduce active monitoring. When the automation eventually fails — and it always does — the operators have lost the situational awareness to catch the failure quickly. Yahoo Finance&lt;br&gt;
This is not a hypothetical for AI SRE agents. It's the natural trajectory of any reliable automation deployed into an on-call rotation.&lt;br&gt;
Week one: engineers review every agent decision. They read the reasoning. They validate the RCA. They check whether the remediation matched what they would have done.&lt;br&gt;
Week four: engineers review decisions that look unusual. The routine ones get a glance and a thumbs up.&lt;br&gt;
Week twelve: engineers check whether the incident resolved. If it did, they move on. The agent's reasoning is no longer being read.&lt;br&gt;
The agent's HER looks healthy. DQR is above threshold. Incidents are resolving faster. Every metric in the governance stack is green.&lt;br&gt;
What isn't being measured: whether any human has verified that the agent's reasoning was correct, or whether the team has simply started accepting the outcome as a proxy for correctness.&lt;br&gt;
Introducing ACR — Automation Complacency Rate&lt;br&gt;
ACR is the fraction of agent decisions accepted by the team without active human verification of the reasoning, measured on a rolling window.&lt;br&gt;
Formally:&lt;br&gt;
ACR(A, t) = decisions_accepted_without_verification(A, t) &lt;br&gt;
            / total_agent_decisions(A, t)&lt;br&gt;
Where "accepted without verification" means the human noted the outcome (incident resolved, action taken) but did not read the decision trace, validate the RCA, or challenge the agent's reasoning.&lt;br&gt;
This distinction matters because outcome acceptance and reasoning verification are two completely different signals. An agent can produce the right outcome for the wrong reason, and if your team only checks outcomes, you will not detect the reasoning drift until a novel incident exposes it.&lt;br&gt;
Why ACR Rising Is Not Good News&lt;br&gt;
The instinctive interpretation of rising ACR is that the agent has earned trust. The team reviewed everything early on, found the agent reliable, and appropriately reduced their verification overhead.&lt;br&gt;
That interpretation is sometimes correct. It is also the interpretation that precedes most automation complacency incidents in aviation and industrial control systems.&lt;br&gt;
The correct interpretation of rising ACR depends on what's happening to your other SLIs in the same window.&lt;br&gt;
If ACR rises while DQR is stable and RTD is low — the agent may genuinely have earned reduced oversight for that task class. Narrow the blast radius, document the trust extension, and set a review date.&lt;br&gt;
If ACR rises while DQR shows any downward trend — the team has reduced oversight while agent quality is slipping. This is the danger zone. The two signals moving in opposite directions is your automation complacency signature.&lt;br&gt;
If ACR rises after HER drops — the agent is escalating less AND the team is reviewing less. This is the double-exposure pattern. Maximum undetected risk surface.&lt;br&gt;
Three ACR Signals Worth Tracking&lt;br&gt;
ACR trend — weekly rolling average, 30-day window. A rising trend over any 4-week period warrants a review meeting, not an alert. This is a cultural signal, not an infrastructure signal.&lt;br&gt;
ACR by severity — P1 and P2 incidents should have near-zero ACR. If your team is accepting agent decisions on P1s without verification, that is your risk exposure quantified. Set a hard target: P1 ACR &amp;lt; 5%.&lt;br&gt;
ACR after HER drop — when HER drops meaningfully (agent escalating less often), check ACR in the same window. If both move in the same direction — agent acts more, team reviews less — that combination needs governance intervention before the next novel failure mode.&lt;br&gt;
What Verification Actually Means&lt;br&gt;
Verification is not approval. It doesn't mean the engineer had to intervene or override the agent. It means a human read the agent's reasoning trace, validated that the RCA was correct, and confirmed that the remediation matched what an experienced engineer would have chosen.&lt;br&gt;
This takes three to five minutes for a well-structured RTD trace. It's not a burden if the trace is readable. It is a burden if the trace requires excavating through unstructured logs — which is why Layer 3 observability (the RTD module from Post 11 of this series) exists. You cannot verify reasoning you cannot read.&lt;br&gt;
pythonfrom agentsre.automation_complacency import ACRTracker, VerificationRecord&lt;/p&gt;

&lt;p&gt;tracker = ACRTracker(agent_id="devops-agent-v1", task_class="incident-investigation")&lt;/p&gt;

&lt;h1&gt;
  
  
  After each agent decision, log whether a human verified the reasoning
&lt;/h1&gt;

&lt;p&gt;tracker.record(VerificationRecord(&lt;br&gt;
    decision_id="inc-2026-0615-001",&lt;br&gt;
    outcome_accepted=True,          # team noted incident resolved&lt;br&gt;
    reasoning_verified=False,        # nobody read the RTD trace&lt;br&gt;
    severity="P2",&lt;br&gt;
    verifier=None&lt;br&gt;
))&lt;/p&gt;

&lt;p&gt;status = tracker.acr_status()&lt;br&gt;
if status["p2_acr_pct"] &amp;gt; 20.0:&lt;br&gt;
    # P2 incidents being accepted without reasoning review&lt;br&gt;
    # This needs a team conversation, not an automated alert&lt;br&gt;
    notify_sre_lead(status)&lt;br&gt;
The AWS DevOps Agent Specific Context&lt;br&gt;
AWS DevOps Agent works for hours or days without constant intervention. When production incidents occur, it analyzes data across multiple monitoring tools, reviews recent deployments, and coordinates response teams. incident.io&lt;br&gt;
An agent that works for hours without intervention is an agent where ACR will naturally rise. The longer the autonomous window, the more decisions accumulate before any human checkpoint.&lt;br&gt;
This is not a criticism of the agent. It's a governance design requirement. Long-running autonomous agents need structured human verification checkpoints built into the workflow — not just at the end when the incident resolves, but at decision points during the investigation.&lt;br&gt;
The Pre-Action Gate from Post 13 is one checkpoint. The ACR tracker is the measurement layer that tells you whether those checkpoints are actually being used as verification moments or just passed through as administrative steps.&lt;br&gt;
The Postmortem Question&lt;br&gt;
Add this field to every postmortem that involves an autonomous agent action:&lt;br&gt;
"For each significant agent decision during this incident: was the reasoning verified by a human before the next action was taken, or was it accepted based on outcome?"&lt;br&gt;
If the answer is consistently "accepted based on outcome" — your team's ACR is above zero and you don't have a measurement layer for it. That's the gap this post addresses.&lt;br&gt;
The agentsre/automation_complacency.py module is now in the GitHub repo. MIT licensed, zero external dependencies for core logic.&lt;br&gt;
Ajay Devineni | AWS Community Builder | Senior SRE/Platform Engineer&lt;/p&gt;

&lt;p&gt;github.com/Ajay150313/agentsre | dev.to/ajaydevineni&lt;/p&gt;

</description>
      <category>aws</category>
      <category>agents</category>
      <category>ai</category>
      <category>sre</category>
    </item>
    <item>
      <title>Google Published Their AI SRE Blueprint. Here's the Line-by-Line Mapping to What the Community Has Been Building</title>
      <dc:creator>Ajay Devineni</dc:creator>
      <pubDate>Tue, 09 Jun 2026 01:26:43 +0000</pubDate>
      <link>https://dev.to/ajaydevineni/google-published-their-ai-sre-blueprint-heres-the-line-by-line-mapping-to-what-the-community-has-4ff</link>
      <guid>https://dev.to/ajaydevineni/google-published-their-ai-sre-blueprint-heres-the-line-by-line-mapping-to-what-the-community-has-4ff</guid>
      <description>&lt;p&gt;Google published a white paper on May 28 that every SRE should read.&lt;br&gt;
It details how they're architecting a new foundation for reliability with three core components: AI Operator (autonomous mitigation agents), Actus (strict execution guardrails), and IRM Analyzer (continuous evaluation pipelines grounded in human operational memory). The goal: safely govern high-velocity agentic software development at Google's scale. Rootly&lt;br&gt;
I've been building toward the same architecture from the ground up for couple of months not inside Google, but as an independent practitioner trying to solve the same problem for teams who don't have Google's infrastructure or runway.&lt;br&gt;
Reading the whitepaper, I found that every component Google named maps directly to something already in the agentsre library or this series. This post maps them side by side.&lt;br&gt;
Google's Actus → Pre-Action SRE Gate&lt;br&gt;
Actus is Google's physical execution control plane for safe autonomous mitigation — it bounds what an agent can do in production with strict policy enforcement before any action executes. Rootly&lt;br&gt;
That's exactly what the Pre-Action SRE Gate does. Three checks before any autonomous action: error budget remaining (does the system have headroom?), AQDD state (can humans course-correct if this goes wrong?), and HER trend (is this agent already outside its reliable envelope?). If any check fails — agent escalates, does not act.&lt;br&gt;
Google built Actus at the infrastructure level for internal systems. The Pre-Action Gate is the same pattern implemented as a Lambda + CloudWatch + DynamoDB pattern any AWS team can deploy this week.&lt;br&gt;
Google's IRM Analyzer → DQR + RTD&lt;br&gt;
IRM Analyzer is Google's continuous evaluation pipeline that captures human operational memory and runs nightly evaluations to prove agent readiness before deployment and during operation. Rootly&lt;br&gt;
Two metrics from this series do the same work:&lt;br&gt;
DQR (Decision Quality Rate) — is the agent's output correct? Measured continuously, not just at deployment.&lt;br&gt;
RTD (Reasoning Trace Depth) — is the agent's reasoning stable? Re-planning cycles per task. Rises before DQR falls.&lt;br&gt;
Google runs nightly evals against a corpus of human-validated incidents. For teams without that corpus, DQR and RTD measured in 30-day shadow mode are the approximation that's achievable without Google's internal incident database.&lt;br&gt;
Google's AI Operator → The agent that needs ARO&lt;br&gt;
Google SRE has AI agents that continuously monitor and improve playbooks and production documentation based on their usage during incidents. AI agents can also generate new playbooks from incidents. Nova AI Ops&lt;br&gt;
This is AI Operator in action. And it's exactly the class of agent that needs Agent Reliability Ownership (ARO) registration — a named owner, a defined blast radius, and an escalation path — before it starts writing to production documentation.&lt;br&gt;
An agent that can modify runbooks is an agent that can corrupt the guidance every human SRE relies on during an incident. Blast radius definition isn't optional for that class of agent. It's the most important governance artifact you have.&lt;br&gt;
The gap Google doesn't address — fleet governance&lt;br&gt;
Google's whitepaper covers individual agent governance well. What it doesn't cover — because at Google's scale it's a different problem — is fleet-level governance for teams where engineers are deploying their own agent workflows alongside platform-deployed agents.&lt;br&gt;
That's the Agent Sprawl problem from Post 6. The Sprawl Registry and Postmortem Readiness Rate (PRR) from Post 12 address the fleet-level governance gap that Google's architecture assumes away.&lt;br&gt;
What this means for your team&lt;br&gt;
AI SRE technology is arriving faster than the trust frameworks needed to deploy it safely. Sherlocks AI&lt;br&gt;
Google just published the trust framework for their environment. The agentsre library is the open-source implementation of the same framework for everyone else.&lt;br&gt;
The three components that matter most to implement first, in order:&lt;br&gt;
Start with Pre-Action Gate (Actus equivalent) — because an ungated agent is a liability before it's an asset.&lt;br&gt;
Add DQR + RTD monitoring (IRM Analyzer equivalent) — because you can't evaluate what you don't measure.&lt;br&gt;
Register every agent in ARO + Sprawl Registry (AI Operator governance) — because you can't own what you haven't named.&lt;br&gt;
The whitepaper is at sre.google. The library is at github.com/Ajay150313/agentsre.&lt;br&gt;
What component is your team missing most right now?&lt;br&gt;
Ajay Devineni | AWS Community Builder | IEEE Senior Member Senior SRE/Platform Engineer | github.com/Ajay150313/agentsre&lt;/p&gt;

</description>
      <category>ai</category>
      <category>googlecloud</category>
      <category>sre</category>
      <category>devops</category>
    </item>
    <item>
      <title>How to Evaluate Any AI SRE Tool A Practitioner's Framework Built From 15 Posts of Production SLIs</title>
      <dc:creator>Ajay Devineni</dc:creator>
      <pubDate>Thu, 04 Jun 2026 01:17:39 +0000</pubDate>
      <link>https://dev.to/ajaydevineni/how-to-evaluate-any-ai-sre-tool-a-practitioners-framework-built-from-15-posts-of-production-slis-32ml</link>
      <guid>https://dev.to/ajaydevineni/how-to-evaluate-any-ai-sre-tool-a-practitioners-framework-built-from-15-posts-of-production-slis-32ml</guid>
      <description>&lt;p&gt;Title: How to Evaluate Any AI SRE Tool — A Practitioner's Framework Built From 15 Posts of Production SLIs&lt;br&gt;
Your manager just forwarded you a Gartner report. Analyst recognition of the AI SRE category, sustained on-call pressure, immature trust and governance frameworks, and the need for orchestration rather than disconnected agent experiments all arrived together in 2026. The question landing in every SRE team's backlog right now is: should we buy something, build something, or wait? Sherlocks&lt;br&gt;
I've spent four months building the measurement layer for AI agents from scratch — DQR, TIE, HER, AQDD, RTD, CUR, Pre-Action Gate, Semantic Gap detection. Fifteen posts, an open-source library, and a growing arXiv paper. This post is where that work becomes a vendor evaluation framework.&lt;br&gt;
Every claim in this framework maps to a metric I've already defined. You can verify these against any tool — commercial or open-source.&lt;br&gt;
The Problem With Vendor Benchmarks&lt;br&gt;
Datadog's Bits AI SRE decreases time to resolution by up to 95%. New Relic's users resolved incidents 25% faster than those without AI features. Both numbers are published. Both are real — in the environments they measured. Nova AI OpsInfoQ&lt;br&gt;
The question is whether those environments match yours. A 95% MTTR improvement measured on a system with clean telemetry, well-structured runbooks, and narrow incident categories is a different number than what you'll see in a system with fragmented observability, complex dependency graphs, and novel failure modes.&lt;br&gt;
Vendor benchmarks measure the tool in optimal conditions. Your evaluation needs to measure the tool in your conditions. These five questions give you the framework.&lt;br&gt;
Question 1: Does it instrument the reasoning layer?&lt;br&gt;
The semantic gap — the space between what an agent intended and what it executed — is invisible to infrastructure APM. I wrote about this last week using Sherlocks.ai's research: existing tools observe high-level intent or low-level actions, not the correlation between them.&lt;br&gt;
Ask any vendor: do you track re-planning cycles per task? Can I see how many times the agent changed its approach before completing or escalating? Can I query that history after an incident?&lt;br&gt;
If the answer is "we log prompts and tool calls," that's Layer 1 observability. Useful, necessary, insufficient. You need Layer 3 — one structured record per agent task showing the full decision sequence.&lt;br&gt;
What to look for in a demo: ask them to show you a failed task trace. Does it show you the sequence of re-planning decisions, or just the final outcome and the spans?&lt;br&gt;
Question 2: What is the Human Escalation Rate in their benchmark?&lt;br&gt;
HER — the fraction of agent decisions that escalated to human judgment — is the most honest single metric for how autonomous a tool actually is. A low MTTR number paired with a high HER means humans were doing most of the resolution work, faster because the agent assembled context for them. That's valuable. It's not the same as autonomous remediation.&lt;br&gt;
Ask: in your benchmark environment, what percentage of incidents did the agent resolve without human action? What percentage required human approval before execution? What triggered escalation most often?&lt;br&gt;
These questions reveal whether the tool is an autonomous remediator or a very good assistant. Both are legitimate. Only one of them matches the vendor's headline claim.&lt;br&gt;
Question 3: Does it check SLO state before acting?&lt;br&gt;
An agent that remediates without checking your current error budget can compound a degraded situation. I formalized this in the Pre-Action SRE Gate (Post 13): three checks before any autonomous action — error budget remaining, AQDD state, and the agent's own HER trend.&lt;br&gt;
Ask any vendor: does your agent check SLO error budget before executing a remediation? What happens if the error budget is critically low — does it act anyway or escalate? Can I configure the pre-action gate thresholds?&lt;br&gt;
A tool that doesn't have an answer to this question is not safe for production systems where the error budget is already burning.&lt;br&gt;
Question 4: What is the defined blast radius per agent?&lt;br&gt;
Komodor's Klaudia is trained specifically on pod crashes, failed rollouts, autoscaler friction, misconfigurations, and cascading failures in Kubernetes environments. That specificity is its blast radius. 95% accuracy in that domain does not mean 95% accuracy outside it. Yisusvii&lt;br&gt;
Every AI SRE tool has an implicit blast radius — the set of systems and failure modes it was trained and tested on. Good tools make this explicit. Ask: what systems can this agent modify autonomously? What systems are write-locked? What failure categories is the accuracy claim based on?&lt;br&gt;
If the vendor can't give you a concrete blast radius definition, the accuracy number is a marketing claim. If they can, you can evaluate whether that blast radius covers your actual failure distribution.&lt;br&gt;
Question 5: What is the ownership model when it's wrong?&lt;br&gt;
This is the question vendors like least. When the agent makes a bad remediation decision and compounds the incident, who is accountable? The vendor's SLA covers service availability, not the operational consequences of an agent action.&lt;br&gt;
In your environment, the answer should map to your ARO (Agent Reliability Ownership) registration — a named human owner, a defined escalation path, and an audit log of every gate check the agent ran before acting.&lt;br&gt;
Ask any vendor: does your tool generate an audit log of agent decision reasoning before each action? Is that log queryable during incident review? Who owns the agent's behavior in my environment?&lt;br&gt;
If the audit log doesn't exist, you cannot write a complete postmortem after an agent-involved incident. That's the accountability gap that makes autonomous agents unsafe in regulated production environments.&lt;br&gt;
The Build vs Buy Decision Matrix&lt;br&gt;
Given these five questions, here's how I'd frame the build-vs-buy decision:&lt;br&gt;
Buy if: Your failure distribution maps closely to the tool's blast radius, you don't need custom SLIs beyond what the vendor provides, and the vendor can answer all five questions with specifics.&lt;br&gt;
Build if: Your failure distribution is broad or novel, you need custom SLIs (DQR, RTD, HER, AQDD are all absent from commercial tools today), or you need to satisfy regulatory requirements that mandate audit trails the vendor doesn't generate.&lt;br&gt;
Hybrid (most realistic): Buy the investigation layer — vendor tools are genuinely good at assembling incident context faster than humans. Build the governance layer — Pre-Action Gates, ARO registration, Semantic Gap detection, Sprawl Registry. The agentsre library is designed for exactly this hybrid.&lt;br&gt;
The Evaluation Scorecard&lt;br&gt;
python# agentsre/tool_evaluation.py&lt;/p&gt;

&lt;p&gt;from dataclasses import dataclass, field&lt;br&gt;
from typing import Dict, List, Optional&lt;br&gt;
import json&lt;br&gt;
from datetime import datetime, timezone&lt;/p&gt;

&lt;p&gt;@dataclass&lt;br&gt;
class ToolEvaluationScore:&lt;br&gt;
    """&lt;br&gt;
    Five-question evaluation scorecard for AI SRE tooling.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Use this to evaluate commercial tools or internal builds
against the SLI framework from the agentsre series.

Score each question 0 (no), 1 (partial), 2 (yes).
Total score &amp;gt;= 8: consider for production.
Total score 5-7: pilot with governance layer built separately.
Total score &amp;lt; 5: not production-ready for autonomous operation.
"""
tool_name: str
evaluator: str
environment_context: str  # Brief description of your stack

# Q1: Reasoning layer instrumentation
tracks_replanning_cycles: int = 0    # 0/1/2
can_query_decision_sequence: int = 0
q1_notes: str = ""

# Q2: HER transparency
her_in_benchmark_disclosed: int = 0
autonomous_vs_assisted_split_disclosed: int = 0
q2_notes: str = ""

# Q3: Pre-action SLO gate
checks_error_budget_before_acting: int = 0
gate_thresholds_configurable: int = 0
q3_notes: str = ""

# Q4: Blast radius definition
blast_radius_explicit: int = 0
accuracy_claim_scoped_to_blast_radius: int = 0
q4_notes: str = ""

# Q5: Ownership and audit
generates_decision_audit_log: int = 0
audit_log_queryable_postmortem: int = 0
q5_notes: str = ""

evaluated_at: str = field(
    default_factory=lambda: datetime.now(timezone.utc).isoformat()
)

@property
def total_score(self) -&amp;gt; int:
    return (
        self.tracks_replanning_cycles +
        self.can_query_decision_sequence +
        self.her_in_benchmark_disclosed +
        self.autonomous_vs_assisted_split_disclosed +
        self.checks_error_budget_before_acting +
        self.gate_thresholds_configurable +
        self.blast_radius_explicit +
        self.accuracy_claim_scoped_to_blast_radius +
        self.generates_decision_audit_log +
        self.audit_log_queryable_postmortem
    )

@property
def recommendation(self) -&amp;gt; str:
    if self.total_score &amp;gt;= 8:
        return "CONSIDER: meets production governance bar"
    elif self.total_score &amp;gt;= 5:
        return "PILOT: build governance layer separately before production"
    else:
        return "NOT READY: missing critical governance capabilities"

def to_report(self) -&amp;gt; Dict:
    return {
        "tool": self.tool_name,
        "evaluator": self.evaluator,
        "environment": self.environment_context,
        "scores": {
            "q1_reasoning_layer": {
                "tracks_replanning": self.tracks_replanning_cycles,
                "queryable_decision_sequence": self.can_query_decision_sequence,
                "notes": self.q1_notes
            },
            "q2_her_transparency": {
                "her_disclosed": self.her_in_benchmark_disclosed,
                "autonomous_split_disclosed": self.autonomous_vs_assisted_split_disclosed,
                "notes": self.q2_notes
            },
            "q3_pre_action_gate": {
                "checks_error_budget": self.checks_error_budget_before_acting,
                "configurable_thresholds": self.gate_thresholds_configurable,
                "notes": self.q3_notes
            },
            "q4_blast_radius": {
                "explicit_definition": self.blast_radius_explicit,
                "accuracy_scoped": self.accuracy_claim_scoped_to_blast_radius,
                "notes": self.q4_notes
            },
            "q5_audit_ownership": {
                "audit_log_generated": self.generates_decision_audit_log,
                "queryable_in_postmortem": self.audit_log_queryable_postmortem,
                "notes": self.q5_notes
            }
        },
        "total_score": f"{self.total_score}/20",
        "recommendation": self.recommendation,
        "evaluated_at": self.evaluated_at
    }

def to_json(self) -&amp;gt; str:
    return json.dumps(self.to_report(), indent=2)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Where This Fits in the Arc&lt;br&gt;
Posts 1–14 built the measurement framework: SLIs for agent output quality, control plane reliability, reasoning observability, context management, ownership governance, semantic gap detection.&lt;br&gt;
Post 15 is the practical payoff — you now have a five-question framework, grounded in production SLIs, to evaluate any AI SRE tool your manager asks you to assess. Whether the answer is buy, build, or hybrid, the framework gives you a defensible, technically grounded recommendation.&lt;br&gt;
The ToolEvaluationScore dataclass is in agentsre/tool_evaluation.py. Use it to document your evaluation and generate a report you can share with your team.&lt;br&gt;
Ajay Devineni | AWS Community Builder | Senior SRE/Platform Engineer&lt;br&gt;
github.com/Ajay150313/agentsre | dev.to/ajaydevineni&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F650n1vy2sob2h1wdqhlb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F650n1vy2sob2h1wdqhlb.png" alt=" " width="800" height="1000"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agentaichallenge</category>
      <category>devops</category>
      <category>automation</category>
    </item>
    <item>
      <title>The AI Pilot-to-Production Gap Is an SRE Problem And We Already Know How to Close It</title>
      <dc:creator>Ajay Devineni</dc:creator>
      <pubDate>Wed, 03 Jun 2026 02:01:55 +0000</pubDate>
      <link>https://dev.to/ajaydevineni/the-ai-pilot-to-production-gap-is-an-sre-problem-and-we-already-know-how-to-close-it-50el</link>
      <guid>https://dev.to/ajaydevineni/the-ai-pilot-to-production-gap-is-an-sre-problem-and-we-already-know-how-to-close-it-50el</guid>
      <description>&lt;p&gt;A startup raised $50M this week to help companies move AI out of stalled pilots and into production. Investors called it "the defining gap of 2026." Salesforce published that "getting agents to run reliably in production" is the common thread behind every significant AI engineering breakthrough this year.&lt;/p&gt;

&lt;p&gt;Both are right about the problem. Neither named the solution.&lt;/p&gt;

&lt;p&gt;The AI pilot-to-production gap is not a new kind of problem. It is a very old kind of problem wearing a new coat. The SRE discipline has been closing this exact gap — for distributed systems, for microservices, for Kubernetes — for two decades. The tools exist. The frameworks are documented. What's missing is the organizational willingness to apply them to AI before the first production incident instead of after.&lt;/p&gt;

&lt;p&gt;This article is about what that actually looks like.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Pilots Feel Production-Ready (And Aren't)
&lt;/h2&gt;

&lt;p&gt;An AI agent demo in a sandbox is a controlled environment. The data is clean. The tools respond predictably. The task volume is low. The team running the demo knows the system well enough to guide it toward success.&lt;/p&gt;

&lt;p&gt;Production is different in every way that matters:&lt;/p&gt;

&lt;p&gt;Real data has edge cases the sandbox never saw. Tools fail, return ambiguous responses, or change their APIs. Task volume spikes at the worst possible time. The team running the system during an incident at 2am is not the team that built the demo.&lt;/p&gt;

&lt;p&gt;The gap between those two environments is not an AI problem. It is a reliability engineering problem. And it has a well-known set of solutions.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Three Things Most Teams Skip
&lt;/h2&gt;

&lt;p&gt;After studying numerous production AI agent deployments across regulated industries, I've identified three reliability discipline components that are almost universally absent when a pilot fails to survive contact with production:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. An SLO defined before go-live
&lt;/h3&gt;

&lt;p&gt;The single most common failure mode in AI pilot-to-production transitions is deploying without a defined success criteria.&lt;/p&gt;

&lt;p&gt;What does reliable operation look like for this agent? What is the acceptable escalation rate? The acceptable decision quality drift? The acceptable tool invocation efficiency?&lt;/p&gt;

&lt;p&gt;These are the agent's SLIs. Without defining them before deployment, there is no way to know whether the agent is performing within acceptable bounds — until a user reports a problem.&lt;/p&gt;

&lt;p&gt;In traditional SRE practice, you don't ship a service without an SLO. The agent is a service. The same rule applies.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agentsre&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AgentSLICollector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TaskRecord&lt;/span&gt;

&lt;span class="c1"&gt;# Define these BEFORE go-live, not after the first incident
&lt;/span&gt;&lt;span class="n"&gt;SLO_TARGETS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;decision_quality_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;85.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="c1"&gt;# DQR: % decisions within behavioral bounds
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_invocation_efficiency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;1.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# TIE: max drift from baseline (multiplier)
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;human_escalation_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;5.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="c1"&gt;# HER: % tasks requiring human intervention
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;collector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AgentSLICollector&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# After each task:
&lt;/span&gt;&lt;span class="n"&gt;collector&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;record&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;TaskRecord&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;task_class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer-routing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tool_calls&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;actual_tool_calls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;decision_confidence&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model_confidence_score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;required_escalation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;task_needed_human&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;completed&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Check breach against pre-defined SLO
&lt;/span&gt;&lt;span class="n"&gt;breaches&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;collector&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;breached&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer-routing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;breaches&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;breaches&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;alert_oncall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;alert_message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. A named owner assigned before go-live
&lt;/h3&gt;

&lt;p&gt;"The AI team owns it" is not an ownership model. It is a responsibility diffusion pattern. When an AI agent degrades at 2am, "the AI team" does not have a pager.&lt;/p&gt;

&lt;p&gt;Before any AI agent goes to production, one named person must be assigned as the agent's Service Reliability Owner. That person:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Receives the page when the agent's SLO breaches&lt;/li&gt;
&lt;li&gt;Owns the runbook for known failure modes&lt;/li&gt;
&lt;li&gt;Reviews the agent's SLI report weekly&lt;/li&gt;
&lt;li&gt;Approves any change to the agent's autonomous permission scope&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the same accountability model that applies to every production microservice. The agent is not exempt because it's AI. The agent is not exempt because it's new. The exception is never justified in SRE practice, and it shouldn't be here.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. A runbook written before go-live
&lt;/h3&gt;

&lt;p&gt;A runbook for an AI agent does not need to be long. It needs to answer four questions:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Detection:&lt;/strong&gt; Which metric tells you the agent is degrading? (Answer: whichever of DQR, TIE, HER, or AQDD breaches first — not latency or error rate, which won't surface semantic failures)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Attribution:&lt;/strong&gt; How do you determine whether the degradation is the agent's behavior, the tools it's calling, or a code change in the agent's environment? (Answer: compare against pre-deployment behavioral baselines)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Containment:&lt;/strong&gt; What is the fastest path to reducing blast radius while you investigate? (Answer: the progressive autonomy constraint ladder — reduce permissions level by level, don't binary-kill the agent)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recovery:&lt;/strong&gt; What does returning to normal operation look like, and how do you know you're there? (Answer: SLI metrics returning to within 10% of pre-incident baselines for 30 consecutive minutes)&lt;/p&gt;

&lt;p&gt;Two hours to write. Six hours saved on the first incident.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the $50M Is Actually Buying
&lt;/h2&gt;

&lt;p&gt;The startup that raised $50M to close the pilot-to-production gap is selling tooling that helps teams implement governance, monitoring, and reliability structures for AI deployments.&lt;/p&gt;

&lt;p&gt;The governance, monitoring, and reliability structures themselves are not new. They are SRE. They are documented. They are open-source.&lt;/p&gt;

&lt;p&gt;What the money buys is the product layer that makes it easier for teams without SRE expertise to apply them. That's a legitimate service. But for teams with SRE expertise, the foundations are already there.&lt;/p&gt;

&lt;p&gt;Instrument your agent's behavioral SLIs. Define targets before deployment. Assign a named owner. Write the runbook. Run a tabletop exercise for your top two failure scenarios before go-live.&lt;/p&gt;

&lt;p&gt;That is the pilot-to-production gap, closed. Not with $50M. With process.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Pattern That Keeps Repeating
&lt;/h2&gt;

&lt;p&gt;The SRE community has seen this pattern before.&lt;/p&gt;

&lt;p&gt;Microservices: teams deployed distributed services without SLOs or ownership models. Incidents happened. The SRE discipline developed the governance layer and production stabilized.&lt;/p&gt;

&lt;p&gt;Kubernetes: teams deployed container orchestration without runbooks or blast radius models. Incidents happened. The SRE discipline developed the governance layer and production stabilized.&lt;/p&gt;

&lt;p&gt;AI agents: teams are deploying autonomous systems without SLOs, owners, or runbooks. Incidents are happening. The SRE discipline has the governance layer ready.&lt;/p&gt;

&lt;p&gt;The question is whether teams apply it before or after the incidents.&lt;/p&gt;

&lt;p&gt;Salesforce is right that the biggest 2026 AI engineering breakthroughs revolve around production reliability. Every one of those breakthroughs will, on inspection, be a form of SRE discipline applied to a new layer of the stack.&lt;/p&gt;

&lt;p&gt;It was always this. It is this now.&lt;/p&gt;




&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;Before your next AI agent goes to production, answer these five questions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What are this agent's SLIs and SLO targets?&lt;/li&gt;
&lt;li&gt;Who is the named owner whose pager fires when the SLO breaches?&lt;/li&gt;
&lt;li&gt;What does the runbook say for the top two failure modes?&lt;/li&gt;
&lt;li&gt;What is the blast radius if the agent makes a wrong autonomous decision?&lt;/li&gt;
&lt;li&gt;Have you run a tabletop exercise for the 2am incident scenario?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If any answer is "we haven't figured that out yet" — the agent is not production-ready. It is demo-ready.&lt;/p&gt;

&lt;p&gt;Open-source SLI framework: &lt;a href="https://github.com/Ajay150313/agentsre" rel="noopener noreferrer"&gt;https://github.com/Ajay150313/agentsre&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;LinkedIn discussion: &lt;a href="https://www.linkedin.com/posts/ajay-devineni_sre-agenticai-reliability-ugcPost-7467310392701198336-ckGw/?utm_source=share&amp;amp;utm_medium=member_desktop&amp;amp;rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU" rel="noopener noreferrer"&gt;https://www.linkedin.com/posts/ajay-devineni_sre-agenticai-reliability-ugcPost-7467310392701198336-ckGw/?utm_source=share&amp;amp;utm_medium=member_desktop&amp;amp;rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What's the one reliability discipline component most teams skip when moving AI agents to production — in your experience?&lt;/p&gt;

</description>
      <category>sre</category>
      <category>aws</category>
      <category>devops</category>
      <category>agentaichallenge</category>
    </item>
    <item>
      <title>The Semantic Gap Why Your APM Sees the Agent But Misses the Decision, and What RTD Does About It</title>
      <dc:creator>Ajay Devineni</dc:creator>
      <pubDate>Tue, 02 Jun 2026 13:22:52 +0000</pubDate>
      <link>https://dev.to/ajaydevineni/the-semantic-gap-why-your-apm-sees-the-agent-but-misses-the-decision-and-what-rtd-does-about-it-3lc7</link>
      <guid>https://dev.to/ajaydevineni/the-semantic-gap-why-your-apm-sees-the-agent-but-misses-the-decision-and-what-rtd-does-about-it-3lc7</guid>
      <description>&lt;p&gt;Sherlocks.ai published something yesterday that names a problem precisely.&lt;br&gt;
The core problem: traditional APM was built for synchronous request response. Agents break that model entirely, and most observability platforms are stitching together legacy APM rather than observing agents as a distinct thing. If your observability stack cannot correlate an agent's intended action with what actually happened at the system level, you are flying blind through the exact moments when cost and risk concentrate. Sherlocks AI&lt;br&gt;
They call it the semantic gap. I've been building toward this from a different direction across this series starting with RTD (Reasoning Trace Depth) in Post 11 and the Pre-Action SRE Gate in Post 13. This post is where those frameworks connect to the industry's emerging framing.What the Semantic Gap Actually Is&lt;br&gt;
Existing tools observe an agent's high-level intent — prompts, tool selections — or its low-level actions — system calls, API hits, latency. They do not correlate both views. You can see the LLM prompt and you can see the system call, but you cannot see whether the agent intended that exact action or reasoned its way to something unexpected. When failure happens, this gap becomes your investigation crater. Sherlocks AI&lt;br&gt;
The gap lives in the decision sequence — what happened between the prompt and the system call. Every re-plan, every tool evaluation, every "this result doesn't match what I expected so I'll try differently" — all of that is invisible to APM because APM instruments execution, not reasoning.&lt;br&gt;
Five percent of AI model requests fail in production today. Roughly sixty percent of those are capacity-related, not model errors. Which means the majority of production failures aren't the model doing something wrong. They're the infrastructure around the model — tool availability, API response times, token budget, context state — creating conditions the agent can't navigate cleanly. And your observability stack is optimized to catch model errors. Sherlocks AI&lt;br&gt;
You're instrumented for the minority failure mode.&lt;br&gt;
How RTD Closes the Semantic Gap&lt;br&gt;
Reasoning Trace Depth is a single structured log entry per agent task — not per tool call. It captures:&lt;/p&gt;

&lt;p&gt;What the agent planned to do initially&lt;br&gt;
Every re-plan event: why, which tool triggered it, what the new plan was&lt;br&gt;
How many cycles before completion or escalation&lt;br&gt;
Whether HER fired at the end&lt;/p&gt;

&lt;p&gt;That record is the intent-to-action correlation layer. It sits above your OTel spans (low-level execution) and below your business metrics (outcome). It's the semantic layer that connects "agent received this task" to "here's exactly how the decision sequence played out."&lt;br&gt;
Without RTD, your investigation after a production failure looks like this: agent ran, spans look clean, outcome was bad, no idea what the agent decided between the tool calls.&lt;br&gt;
With RTD, it looks like this: agent re-planned 4 times, tool 3 returned stale data on every attempt, HER fired at re-plan 5, here is the full decision sequence with timestamps.&lt;br&gt;
That second version is a postmortem. The first is a guess.&lt;br&gt;
What the Market Is Getting Right and Missing&lt;br&gt;
Fifteen tools actively compete on agent observability in 2026, most built on OpenTelemetry standards. The critical test for any of them: does it handle reasoning loops as a first-class concern? Can you see the decision tree — prompt, tool choice, outcome, next decision — as a continuous trace? Does it distinguish between a tool failure and an agent misunderstanding? Does it alert on semantic drift, where agent behavior changes but metrics look normal? Sherlocks AI&lt;br&gt;
Those are the right questions. Most tools fail at least two of them because they were designed as APM add-ons, not as reasoning-native observability.&lt;br&gt;
The practical implication: even if you adopt a good commercial agent observability tool, you still need the reasoning trace layer. Commercial tools give you the infrastructure view. RTD gives you the decision view. You need both.&lt;br&gt;
The Three-Layer Stack, Restated&lt;br&gt;
I've been building this framing across the series. The Sherlocks piece clarifies why it matters:&lt;br&gt;
Layer 1 — Infrastructure (APM, OTel, CloudWatch)&lt;br&gt;
What executed. Tool call latency, error rates, span data. Answers: did the tools work? Misses: did the agent reason correctly?&lt;br&gt;
Layer 2 — Control Plane (RAR, RSI, DCS from Post 7)&lt;br&gt;
How the orchestration behaved. Routing accuracy, retry patterns, task decomposition. Answers: did the workflow hold up? Misses: what was the agent deciding inside each task?&lt;br&gt;
Layer 3 — Reasoning (RTD from Post 11)&lt;br&gt;
What the agent decided. Re-plan count, tool sequence, decision rationale, HER correlation. Answers: did the reasoning hold up? This is the semantic gap layer.&lt;br&gt;
If you are buying observability tooling, demand explicit agent loop tracking. Ask for examples. Do not accept "we can log prompts" as an answer. Sherlocks AI&lt;br&gt;
Logging prompts is Layer 1. You need Layer 3.&lt;br&gt;
The Postmortem Template Addition&lt;br&gt;
Every postmortem for an agent-involved incident should now have a section that didn't exist before: Semantic Gap Analysis.&lt;br&gt;
Three fields:&lt;br&gt;
Intent vs. outcome delta — what did the agent plan to do vs. what did it actually do? If these match, the reasoning held. If they diverge, you have a semantic gap event.&lt;br&gt;
Re-plan sequence — RTD value, re-plan reasons, which tools triggered each re-plan. This is where you find the actual root cause in most agent failures.&lt;br&gt;
HER correlation — did HER spike during this task? At which re-plan decision? That's the moment the agent recognized it was outside its reliable envelope.&lt;br&gt;
Without these three fields, your postmortem explains what broke. It can't explain why the agent did what it did before the break.&lt;br&gt;
Where This Fits in the Arc&lt;br&gt;
Post 4: SLOs for agents (DQR, TIE, HER, AQDD) — what to measure.&lt;br&gt;
Post 7: Control plane SLIs (RAR, RSI, DCS) — where Layer 2 lives.&lt;br&gt;
Post 11: RTD — the Layer 3 reasoning primitive.&lt;br&gt;
Post 13: Pre-Action Gate — using SLIs as authorization signals.&lt;br&gt;
Post 14: The semantic gap — why all three layers are necessary and what happens without Layer 3.&lt;br&gt;
The industry is arriving at this independently. The frameworks were already here.&lt;br&gt;
Ajay Devineni | AWS Community Builder | Senior SRE/Platform Engineer&lt;br&gt;
github.com/Ajay150313/agentsre&lt;/p&gt;

</description>
      <category>sre</category>
      <category>agentaichallenge</category>
      <category>aws</category>
      <category>devops</category>
    </item>
    <item>
      <title>Your Agent Acts Without Checking Your Error Budget — That's the Failure Mode Nobody Is Tracking</title>
      <dc:creator>Ajay Devineni</dc:creator>
      <pubDate>Tue, 26 May 2026 17:34:07 +0000</pubDate>
      <link>https://dev.to/ajaydevineni/your-agent-acts-without-checking-your-error-budget-thats-the-failure-mode-nobody-is-tracking-29n0</link>
      <guid>https://dev.to/ajaydevineni/your-agent-acts-without-checking-your-error-budget-thats-the-failure-mode-nobody-is-tracking-29n0</guid>
      <description>&lt;p&gt;Yesterday a piece came out that framed something I've been watching build across production environments for months.&lt;br&gt;
There is a category of production incident that engineering teams are not tracking yet — because it doesn't fit any existing postmortem template. The agent initiated an action. The action was technically correct given the agent's context. The context was incomplete. The infrastructure cascaded. By the time the incident review happened, three teams were arguing about whether it was an agent failure or an infrastructure failure. Kore.ai&lt;br&gt;
That argument happens because the two disciplines — SRE and autonomous agents — have never been formally connected at the decision layer.&lt;br&gt;
Here's the connection I want to make explicit.&lt;br&gt;
What Chaos Engineering Gets Right&lt;br&gt;
Mature chaos engineering programs have a property that's easy to overlook because it's invisible when it's working. Before a human engineer initiates any experiment — a fault injection, a latency spike, a dependency kill — they make a judgment call: does this system have capacity to absorb a perturbation right now?&lt;br&gt;
They check error budget burn rate. They look at whether upstream dependencies are stable. They assess whether the on-call team has bandwidth to respond if something goes wrong. They check whether there's a deploy in flight that makes this a bad time.&lt;br&gt;
That judgment call is informal, often intuitive, and sometimes wrong. But it exists. It's the human-in-the-loop that decides whether the system is in a state to safely absorb autonomous action.&lt;br&gt;
Agents don't make that call. They evaluate their task context, form a plan, and execute. The question "is right now a safe time for this action given the current reliability state of the system?" is not in their decision loop.&lt;br&gt;
The agents delivering production value in 2026 share one defining property: bounded scope. The agent handles one domain, with a defined tool set, and explicitly refuses tasks outside that boundary. The boundary is what makes autonomous deployment safe. GlobeNewswire&lt;br&gt;
Boundary on task scope is necessary. It's not sufficient. You also need a boundary on timing — a gate that checks whether the system's current reliability state can absorb what the agent is about to do.&lt;br&gt;
The Pre-Action SRE Gate&lt;br&gt;
I want to introduce a concrete pattern here: the Pre-Action SRE Gate — a check an agent runs against your existing SRE signals before executing any state-changing action.&lt;br&gt;
The gate has three checks, all using metrics I've built out across this series:&lt;br&gt;
Check 1 — Error Budget Headroom&lt;br&gt;
Before acting, the agent queries current SLO error budget remaining for the services in its blast radius. If error budget is below threshold — the system is already burning faster than acceptable — the agent does not act autonomously. It escalates.&lt;br&gt;
This is the chaos engineering judgment call, formalized as a programmatic check.&lt;br&gt;
Check 2 — AQDD State&lt;br&gt;
Approval Queue Depth Drift tells you whether the human oversight layer is already backed up. If AQDD is elevated — meaning humans can't process approvals fast enough — autonomous action during that window means any mistake won't be caught in time. Agent holds.&lt;br&gt;
Check 3 — HER Trend&lt;br&gt;
If the agent's own Human Escalation Rate has been elevated in the recent window, it's operating outside its reliable envelope. Letting it take autonomous action in that state compounds the risk. Agent escalates.&lt;br&gt;
None of these metrics are new. They're from Post 4 and Post 10 of this series. What's new is using them as gates before action, not just as observability signals after the fact.&lt;br&gt;
python# agentsre/pre_action_gate.py&lt;/p&gt;

&lt;p&gt;from dataclasses import dataclass&lt;br&gt;
from typing import Optional&lt;br&gt;
from datetime import datetime, timezone&lt;br&gt;
import json&lt;/p&gt;

&lt;p&gt;@dataclass&lt;br&gt;
class SREGateResult:&lt;br&gt;
    """&lt;br&gt;
    Result of a Pre-Action SRE Gate check.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;If approved is False, the agent must not proceed with
autonomous action — escalate to human owner per ARO record.

Attributes:
    approved: Whether autonomous action is cleared
    blocking_check: Which check blocked (if any)
    error_budget_pct: Current error budget remaining (0-100)
    aqdd_depth: Current approval queue depth
    her_trend: Recent HER rate (0-100)
    recommendation: What the agent should do
    checked_at: Timestamp of gate check
"""
approved: bool
blocking_check: Optional[str]
error_budget_pct: float
aqdd_depth: int
her_trend: float
recommendation: str
checked_at: str
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;class PreActionSREGate:&lt;br&gt;
    """&lt;br&gt;
    Pre-Action SRE Gate — checks your SRE signal state before&lt;br&gt;
    an agent executes any autonomous write, remediation, scale&lt;br&gt;
    event, or config change.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;This is the chaos engineering judgment call, formalized.
A human engineer checks these things before running an experiment.
Your agent should check them before acting autonomously.

Thresholds should be calibrated per agent and task class
in shadow mode — same protocol as HER and RTD baselines.
"""

def __init__(self,
             error_budget_min_pct: float = 20.0,
             aqdd_max_depth: int = 3,
             her_max_trend_pct: float = 15.0):
    """
    Args:
        error_budget_min_pct: Minimum error budget % required
            for autonomous action. Below this = escalate.
            Default 20% — agent should not consume budget
            that's already critically low.
        aqdd_max_depth: Max approval queue depth before
            autonomous action is blocked. Above this,
            humans can't course-correct fast enough.
        her_max_trend_pct: Max recent HER rate before
            autonomous action is blocked. Elevated HER
            means agent is already outside reliable envelope.
    """
    self.error_budget_min_pct = error_budget_min_pct
    self.aqdd_max_depth = aqdd_max_depth
    self.her_max_trend_pct = her_max_trend_pct

def check(self,
          agent_id: str,
          intended_action: str,
          error_budget_pct: float,
          aqdd_depth: int,
          her_trend_pct: float) -&amp;gt; SREGateResult:
    """
    Run pre-action SRE gate check.

    Call this before any autonomous state-changing action.
    If result.approved is False — escalate, do not act.

    Args:
        agent_id: Agent requesting action clearance
        intended_action: Description of what agent plans to do
        error_budget_pct: Current error budget remaining (0-100)
        aqdd_depth: Current approval queue depth
        her_trend_pct: Agent's recent HER rate (0-100)

    Returns:
        SREGateResult with approval decision and reasoning
    """
    # Check 1: Error budget headroom
    if error_budget_pct &amp;lt; self.error_budget_min_pct:
        return SREGateResult(
            approved=False,
            blocking_check="error_budget",
            error_budget_pct=error_budget_pct,
            aqdd_depth=aqdd_depth,
            her_trend=her_trend_pct,
            recommendation=(
                f"Error budget at {error_budget_pct:.1f}% — "
                f"below {self.error_budget_min_pct}% minimum. "
                "Escalate to human owner. Do not act autonomously."
            ),
            checked_at=datetime.now(timezone.utc).isoformat()
        )

    # Check 2: Approval queue state
    if aqdd_depth &amp;gt; self.aqdd_max_depth:
        return SREGateResult(
            approved=False,
            blocking_check="aqdd",
            error_budget_pct=error_budget_pct,
            aqdd_depth=aqdd_depth,
            her_trend=her_trend_pct,
            recommendation=(
                f"Approval queue depth {aqdd_depth} exceeds "
                f"maximum {self.aqdd_max_depth}. "
                "Human oversight is backed up — autonomous action "
                "cannot be safely course-corrected. Hold."
            ),
            checked_at=datetime.now(timezone.utc).isoformat()
        )

    # Check 3: Agent's own HER trend
    if her_trend_pct &amp;gt; self.her_max_trend_pct:
        return SREGateResult(
            approved=False,
            blocking_check="her_trend",
            error_budget_pct=error_budget_pct,
            aqdd_depth=aqdd_depth,
            her_trend=her_trend_pct,
            recommendation=(
                f"HER at {her_trend_pct:.1f}% — "
                f"above {self.her_max_trend_pct}% threshold. "
                "Agent is operating outside reliable envelope. "
                "Escalate rather than act autonomously."
            ),
            checked_at=datetime.now(timezone.utc).isoformat()
        )

    # All checks passed
    return SREGateResult(
        approved=True,
        blocking_check=None,
        error_budget_pct=error_budget_pct,
        aqdd_depth=aqdd_depth,
        her_trend=her_trend_pct,
        recommendation="Autonomous action cleared. Proceed within blast radius.",
        checked_at=datetime.now(timezone.utc).isoformat()
    )

def to_audit_log(self, agent_id: str,
                 intended_action: str,
                 result: SREGateResult) -&amp;gt; dict:
    """
    Structured audit log entry for every gate check.
    Every autonomous action attempt — approved or blocked —
    should be logged. This is your agent action audit trail.
    """
    return {
        "trace_type": "pre_action_gate",
        "agent_id": agent_id,
        "intended_action": intended_action,
        "gate_approved": result.approved,
        "blocking_check": result.blocking_check,
        "sre_signals": {
            "error_budget_pct": result.error_budget_pct,
            "aqdd_depth": result.aqdd_depth,
            "her_trend_pct": result.her_trend,
        },
        "recommendation": result.recommendation,
        "checked_at": result.checked_at,
    }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;How This Connects to the Full Arc&lt;br&gt;
Post 4 introduced DQR, TIE, HER, AQDD as observability SLIs — things you watch.&lt;br&gt;
Post 10 introduced ARO — who owns the agent when those SLIs breach.&lt;br&gt;
Post 11 introduced RTD — the reasoning observability layer.&lt;br&gt;
Post 12 introduced CUR — context budget as a reliability ceiling.&lt;br&gt;
This post introduces the Pre-Action SRE Gate — where all of those signals become decision inputs rather than observability outputs. The agent reads your SRE state before acting, not just after.&lt;br&gt;
Resilience requires explicit investment in circuit breakers, graceful degradation, and clear failure modes that preserve system integrity. Teams building agents must invest in resilience infrastructure before pushing to higher-criticality workloads. SourceForge&lt;br&gt;
The Pre-Action Gate is that infrastructure. It's your agent's circuit breaker — not on retry loops or cost, but on system-level reliability state.&lt;br&gt;
The Postmortem Template Gap&lt;br&gt;
79% of organizations now have AI agents in production. Gartner warns 40% of those projects will be canceled due to poor risk controls. The incidents happening in that gap don't fit existing postmortem templates because current templates ask: what changed? who deployed? what failed? Kore.ai&lt;br&gt;
They don't ask: what was the error budget state when the agent acted? Was AQDD elevated, meaning the approval layer was already overwhelmed? Had the agent's HER been trending up, meaning it was already in unreliable territory?&lt;br&gt;
Those questions need to be in your postmortem template. Add a section: Agent Pre-Action State — error budget at time of action, AQDD depth, HER trend. If your postmortem can't answer those three questions, you don't have the data to prevent the same incident from happening again.&lt;br&gt;
The code is in agentsre/pre_action_gate.py on GitHub. MIT licensed, zero external dependencies.&lt;br&gt;
Ajay Devineni | AWS Community Builder | Senior SRE/Platform Engineer&lt;br&gt;
github.com/Ajay150313/agentsre&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fah1uu32oi9dqse8tf02a.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fah1uu32oi9dqse8tf02a.jpeg" alt=" " width="427" height="640"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>sre</category>
      <category>devops</category>
      <category>cursor</category>
    </item>
    <item>
      <title>Why Your AI Agent Monitoring is Wrong (And How to Fix It)</title>
      <dc:creator>Ajay Devineni</dc:creator>
      <pubDate>Mon, 25 May 2026 11:35:24 +0000</pubDate>
      <link>https://dev.to/ajaydevineni/why-your-ai-agent-monitoring-is-wrong-and-how-to-fix-it-1b25</link>
      <guid>https://dev.to/ajaydevineni/why-your-ai-agent-monitoring-is-wrong-and-how-to-fix-it-1b25</guid>
      <description>&lt;p&gt;As I discussed in my SLO Design article, traditional reliability metrics fail for agentic AI systems. Now let's look at how to actually implement semantic monitoring in production.&lt;br&gt;
Your AI agent is running in production.&lt;br&gt;
HTTP 200. Uptime 99.9%. All dashboards are green.&lt;br&gt;
But it's making the wrong decision 30% of the time.&lt;br&gt;
Your monitoring won't tell you.&lt;br&gt;
The Gap&lt;br&gt;
I spent six months figuring this out the hard way. Traditional SRE monitoring measures infrastructure. Network latency. Error rates. Uptime. It's designed for services that crash when they break. But agents don't crash. They degrade. Slowly. Silently.&lt;br&gt;
An agent can be:&lt;/p&gt;

&lt;p&gt;94% accurate (still 94%) &lt;br&gt;
But losing confidence (0.92 to 0.41)&lt;br&gt;
Compensating by calling tools 3x more (1.1x to 3.1x)&lt;br&gt;
While humans reject more of its output (1% to 19%)&lt;br&gt;
As work piles up waiting for approval (8 to 340 items)&lt;/p&gt;

&lt;p&gt;Your monitoring sees "everything is fine."&lt;br&gt;
You see $2M impact by the time you notice.&lt;br&gt;
What We Actually Need to Measure Not infrastructure metrics. Semantic metrics.&lt;br&gt;
Four things:&lt;br&gt;
Decision Quality Rate &lt;strong&gt;(DQR)&lt;/strong&gt;&lt;br&gt;
Is the agent picking the right tool?&lt;br&gt;
Healthy: 92%+&lt;br&gt;
Threshold for action: &amp;lt;85%&lt;br&gt;
Tool Invocation Efficiency **(TIE)**&lt;br&gt;
Is it over-compensating by calling tools more than normal?&lt;br&gt;
Healthy: 1.0-1.2x baseline&lt;br&gt;
Threshold for action: &amp;gt;1.5x&lt;br&gt;
Human Escalation Rate &lt;strong&gt;(HER)&lt;/strong&gt;&lt;br&gt;
Are humans rejecting its decisions?&lt;br&gt;
Healthy: &amp;lt;2%&lt;br&gt;
Threshold for action: &amp;gt;5%&lt;br&gt;
Approval Queue Depth Drift (&lt;strong&gt;AQDD&lt;/strong&gt;)&lt;br&gt;
Is work piling up waiting for approval?&lt;br&gt;
Healthy: &amp;lt;20 pending&lt;br&gt;
Threshold for action: &amp;gt;50 pending&lt;br&gt;
When any of these drift, semantic failure is 48 hours away.&lt;br&gt;
Real Scenario&lt;br&gt;
Tuesday 2pm: Agent starts degrading. DQR drops from 94% to 88%. TIE increases from 1.1x to 1.4x. Nothing alarming yet by traditional metrics.&lt;br&gt;
Your infrastructure monitoring stays green.&lt;br&gt;
Thursday 10am: DQR at 62%. TIE at 3.1x. Queue at 340 items.&lt;br&gt;
Your first alert finally fires - from your infrastructure monitoring noticing error rates creeping up.&lt;br&gt;
You've just lost 40+ hours of bad decisions.&lt;br&gt;
With semantic SLIs, you would have known Tuesday at 2:15pm.&lt;br&gt;
How We Built It&lt;br&gt;
Semantic SLI monitoring system that:&lt;/p&gt;

&lt;p&gt;Tracks what matters - DQR, TIE, HER, AQDD (not uptime)&lt;br&gt;
Detects degradation early - 48 hours before traditional SLIs Suggests remediation - Not just "something's wrong" Automates response - Progressive autonomy constraints&lt;/p&gt;

&lt;p&gt;When degradation detected:&lt;/p&gt;

&lt;p&gt;Agent autonomy automatically constrained (FULL → GUIDED → SUPERVISED → BLOCKED)&lt;br&gt;
Slack notification sent with context&lt;br&gt;
Remediation steps suggested (prioritized by success rate)&lt;br&gt;
Everything tracked for audit and learning&lt;/p&gt;

&lt;p&gt;Code Example&lt;br&gt;
pythonfrom agentsre.orchestration import FintechSREOrchestrator, AgentRole, AlertManager&lt;/p&gt;

&lt;h1&gt;
  
  
  Initialize orchestrator
&lt;/h1&gt;

&lt;p&gt;orch = FintechSREOrchestrator()&lt;br&gt;
orch.register_agent("payment-1", AgentRole.PAYMENT_PROCESSOR)&lt;/p&gt;

&lt;h1&gt;
  
  
  Initialize alerts
&lt;/h1&gt;

&lt;p&gt;alerts = AlertManager()&lt;/p&gt;

&lt;p&gt;def on_critical_alert(alert_dict):&lt;br&gt;
    send_to_slack(alert_dict)&lt;/p&gt;

&lt;p&gt;alerts.slack_handler = on_critical_alert&lt;/p&gt;

&lt;h1&gt;
  
  
  Update metrics as agent runs
&lt;/h1&gt;

&lt;p&gt;orch.update_metrics(&lt;br&gt;
    agent_id="payment-1",&lt;br&gt;
    dqr=62.0,      # Decision quality degraded&lt;br&gt;
    tie=2.8,       # Tool calls increased&lt;br&gt;
    her=15.0,      # Escalations up&lt;br&gt;
    aqd=180,       # Queue growing&lt;br&gt;
    confidence=0.42,&lt;br&gt;
    cost=0.0003&lt;br&gt;
)&lt;/p&gt;

&lt;h1&gt;
  
  
  Create alert with remediation suggestions
&lt;/h1&gt;

&lt;p&gt;alert = alerts.create_alert(&lt;br&gt;
    agent_id="payment-1",&lt;br&gt;
    reason="Semantic degradation detected",&lt;br&gt;
    triggered_metrics=["DQR", "TIE", "HER"],&lt;br&gt;
    current_values={&lt;br&gt;
        "dqr": 62.0,&lt;br&gt;
        "tie": 2.8,&lt;br&gt;
        "her": 15.0,&lt;br&gt;
        "aqd": 180&lt;br&gt;
    }&lt;br&gt;
)&lt;/p&gt;

&lt;h1&gt;
  
  
  Get remediation steps
&lt;/h1&gt;

&lt;p&gt;for step in alert.suggested_remediations[:3]:&lt;br&gt;
    print(f"→ {step.action} ({step.estimated_time_minutes}min)")&lt;br&gt;
Output:&lt;br&gt;
→ Review latest 10 agent decisions - identify pattern (15min)&lt;br&gt;
→ Check upstream service - likely returning bad data (10min)&lt;br&gt;
→ Agent over-compensating - check confidence scores (10min)&lt;br&gt;
What This Means for SRE&lt;br&gt;
You're not just detecting problems. You're understanding them.&lt;br&gt;
Instead of:&lt;/p&gt;

&lt;p&gt;"Error rate is high"&lt;br&gt;
"Latency is up"&lt;br&gt;
"Something's wrong"&lt;/p&gt;

&lt;p&gt;You get:&lt;/p&gt;

&lt;p&gt;"Agent decision quality dropped 15%, tool calls increased 2.8x, humans rejecting 15% of output, 180 items pending approval"&lt;br&gt;
Suggested fix: Check upstream service (likely corrupting data)&lt;br&gt;
Severity: CRITICAL&lt;/p&gt;

&lt;p&gt;That's the difference between reactive and proactive reliability.&lt;br&gt;
Open Source Built all this open source. MIT licensed.&lt;br&gt;
Tested in production at scale. Works with LangChain, CrewAI, Bedrock.&lt;br&gt;
GitHub: &lt;a href="https://github.com/Ajay150313/agentsre" rel="noopener noreferrer"&gt;https://github.com/Ajay150313/agentsre&lt;/a&gt;&lt;br&gt;
For Your Team&lt;br&gt;
If you're running agents in production, you probably have this problem too. You just don't know it yet.&lt;br&gt;
Try semantic SLIs. If you catch something you didn't know was degrading (most teams do), you'll know it was worth it.&lt;br&gt;
The cost of not knowing? Sometimes it's $2M.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>sre</category>
      <category>devops</category>
      <category>fintech</category>
    </item>
    <item>
      <title>The Context Window Is RAM — Why Your Agent's SLIs Are Telling You It's Full</title>
      <dc:creator>Ajay Devineni</dc:creator>
      <pubDate>Fri, 22 May 2026 02:18:21 +0000</pubDate>
      <link>https://dev.to/ajaydevineni/the-context-window-is-ram-why-your-agents-slis-are-telling-you-its-full-4ejb</link>
      <guid>https://dev.to/ajaydevineni/the-context-window-is-ram-why-your-agents-slis-are-telling-you-its-full-4ejb</guid>
      <description>&lt;p&gt;The Microsoft team that built the Azure SRE Agent published something in January that I keep coming back to.&lt;br&gt;
Six months into building it, they realized they weren't building an SRE agent. They were building a context engineering system that happens to do site reliability engineering. Better models were table stakes, but what moved the needle was what they controlled: disciplined context management. Kore.ai&lt;br&gt;
That framing is exactly right. And it has a reliability implication that I haven't seen anyone write about directly.&lt;br&gt;
The Problem&lt;br&gt;
Your agent's context window is volatile working memory. Fast, expensive, and non-persistent. It's RAM, not storage. When the session ends, it's gone. When it fills up, quality degrades — not linearly, but in ways that are hard to predict and easy to miss.&lt;br&gt;
As you fill the context window, model quality drops non-linearly. "Lost in the middle," "not adhering to my instructions," and long-context degradation show up well before the advertised token limits. More tokens don't just cost latency — they quietly erode accuracy. Kore.ai&lt;br&gt;
That quiet erosion is the reliability failure mode. It doesn't throw an exception. It doesn't spike your error rate. Your agent keeps running. It just makes progressively worse decisions as the context fills.&lt;br&gt;
And here's the part I want to be specific about: you already have the SLIs to catch this. You just haven't connected them to context state yet.&lt;br&gt;
What Context Overflow Looks Like in Your SLIs&lt;br&gt;
When an agent's context fills beyond its effective working range, three things happen in order:&lt;br&gt;
DQR (Decision Quality Rate) drops first. The agent's decisions get worse because early instructions are now competing with thousands of tokens of recent tool output. An instruction from turn 3 gets buried under content that arrived after it — the agent isn't ignoring it, it's attending more reliably to recent content as the session grows. This is a passive decay process, not a model bug. incident.io&lt;br&gt;
RTD (Reasoning Trace Depth) climbs next. The agent re-plans more because its earlier context — what it already established about the problem — is partially decayed. It's not re-planning because something changed. It's re-planning because it partially forgot what it already figured out.&lt;br&gt;
TIE (Tool Invocation Efficiency) degrades last. The agent starts calling tools to reconstruct context it already had. It queries the same data sources again. It re-fetches runbooks it already read. Tool call count per task climbs above baseline while task quality continues to fall.&lt;br&gt;
By the time TIE is visibly elevated, you're already well into the degradation window. DQR was the earlier signal. And DQR dropping in a long-running session, without any external trigger, is your context overflow signature.&lt;br&gt;
The Architecture Fix&lt;br&gt;
Mem0's 2026 benchmarks quantify the difference clearly: full-context baseline (everything packed into the window) scored 72.9% accuracy using 26,000 tokens per query at 17 second p95 latency. A two-layer memory architecture scored 91.6% accuracy using under 7,000 tokens at 1.4 second p95 latency. That's an 18.7 point accuracy improvement while using 4x fewer tokens and cutting latency by 91%. Yahoo Finance&lt;br&gt;
The two-layer architecture is straightforward:&lt;br&gt;
Working memory (context window): Only what's needed for the current decision. Active task state, recent tool results, current instructions. Managed actively — compressed, summarized, or paged out as the session grows.&lt;br&gt;
Persistent memory (external store): Facts that persist across decisions and sessions. User preferences, established system state, prior investigation findings, runbook contents. Fetched into context when relevant, not kept resident the whole time.&lt;br&gt;
The discipline is knowing what belongs in each layer and managing the boundary actively.&lt;br&gt;
Connecting This to Your Production Readiness Checklist&lt;br&gt;
Before a long-running agent goes to production, two questions need answers:&lt;br&gt;
What is the expected context budget for a typical session? Not the model's maximum. The budget at which you've measured DQR starting to degrade for this specific agent on this specific task class. That number is your operational ceiling, not the advertised token limit.&lt;br&gt;
What happens when the agent approaches that ceiling? Does it compress? Summarize and page out? Escalate to human? Or does it silently continue with degrading accuracy until something downstream notices?&lt;br&gt;
If the answer to the second question is "it keeps going," that's your reliability gap. The context ceiling needs the same circuit breaker thinking as your token budget ceiling from the cost post.&lt;br&gt;
python# agentsre/context_budget.py&lt;/p&gt;

&lt;p&gt;from dataclasses import dataclass, field&lt;br&gt;
from typing import Optional&lt;br&gt;
import json&lt;br&gt;
from datetime import datetime, timezone&lt;/p&gt;

&lt;p&gt;@dataclass&lt;br&gt;
class ContextBudgetTracker:&lt;br&gt;
    """&lt;br&gt;
    Track context utilization against operational DQR ceiling.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;The model's advertised token limit is not your operational limit.
Your operational limit is the token count at which DQR starts
to degrade for this agent on this task class. Establish that
baseline in shadow mode. Set your ceiling below it.

Attributes:
    agent_id: Agent being tracked
    task_class: Task type (DQR ceiling varies by task complexity)
    operational_ceiling_tokens: Tokens at which DQR degrades
        for this agent/task combination. NOT the model's max.
    warning_threshold_pct: Fraction of ceiling triggering warning
    current_tokens: Current context utilization
"""
agent_id: str
task_class: str
operational_ceiling_tokens: int
warning_threshold_pct: float = 0.75
current_tokens: int = 0
session_id: str = ""
compression_events: int = 0

@property
def utilization_pct(self) -&amp;gt; float:
    """Current context utilization as fraction of operational ceiling."""
    return self.current_tokens / self.operational_ceiling_tokens

@property
def budget_status(self) -&amp;gt; str:
    """
    OK — within safe operating range
    WARNING — approaching DQR degradation ceiling
    CRITICAL — at or above operational ceiling, DQR degrading
    """
    u = self.utilization_pct
    if u &amp;lt; self.warning_threshold_pct:
        return "OK"
    elif u &amp;lt; 1.0:
        return "WARNING"
    return "CRITICAL"

def update(self, current_tokens: int) -&amp;gt; dict:
    """
    Update current context utilization and return status record.
    Call this after each tool call or model response.

    Returns status record for logging to CloudWatch / Datadog.
    """
    self.current_tokens = current_tokens
    record = {
        "agent_id": self.agent_id,
        "session_id": self.session_id,
        "task_class": self.task_class,
        "current_tokens": self.current_tokens,
        "operational_ceiling": self.operational_ceiling_tokens,
        "utilization_pct": round(self.utilization_pct, 3),
        "budget_status": self.budget_status,
        "compression_events": self.compression_events,
        "timestamp": datetime.now(timezone.utc).isoformat(),
    }
    return record

def record_compression(self) -&amp;gt; None:
    """Call when context compression or summarization fires."""
    self.compression_events += 1

def should_compress(self) -&amp;gt; bool:
    """True when context is approaching DQR degradation ceiling."""
    return self.utilization_pct &amp;gt;= self.warning_threshold_pct

def should_escalate(self) -&amp;gt; bool:
    """
    True when context is at or above operational ceiling.
    At this point DQR is actively degrading.
    Escalate to human or terminate session cleanly.
    """
    return self.utilization_pct &amp;gt;= 1.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The Practical Baseline Protocol&lt;br&gt;
Before you can set an operational context ceiling, you need to know where DQR actually starts to degrade for your specific agent on your specific task class. The steps:&lt;br&gt;
Run the agent in shadow mode on a representative sample of tasks. Record DQR at 25%, 50%, 75%, and 100% of the model's advertised context limit. Find the inflection point — where DQR starts dropping. Set your operational ceiling at 80% of that inflection point. That's your warning threshold. At the ceiling, trigger compression or escalation, not continuation.&lt;br&gt;
This is the same baseline protocol as HER and RTD. Thirty days of shadow mode, measure the metric, set the threshold. The only difference is that context budget degradation is session-scoped rather than task-scoped.&lt;br&gt;
Why This Post Belongs in This Series&lt;br&gt;
Post 4 established DQR as your output quality SLI. Post 9 established token budget as a cost circuit breaker. Post 11 introduced RTD as your reasoning observability layer.&lt;br&gt;
This post connects all three: context window mismanagement is the common cause that degrades DQR, elevates RTD, and burns your token budget simultaneously. Fix the memory architecture and you see improvement across all three SLIs. That's not a coincidence — they're measuring the same failure from different angles.&lt;br&gt;
The code is in agentsre/context_budget.py on GitHub. MIT licensed, zero external dependencies.&lt;br&gt;
Ajay Devineni | AWS Community Builder | Senior SRE/Platform Engineer&lt;br&gt;
github.com/Ajay150313/agentsre&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhv6pnufmeze84rbn9kur.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhv6pnufmeze84rbn9kur.jpeg" alt=" " width="427" height="640"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>sre</category>
      <category>devops</category>
      <category>azure</category>
    </item>
    <item>
      <title>Your OTel Traces Are Lying to You Observability for the Reasoning Layer</title>
      <dc:creator>Ajay Devineni</dc:creator>
      <pubDate>Tue, 19 May 2026 02:12:12 +0000</pubDate>
      <link>https://dev.to/ajaydevineni/your-otel-traces-are-lying-to-you-observability-for-the-reasoning-layer-2f7p</link>
      <guid>https://dev.to/ajaydevineni/your-otel-traces-are-lying-to-you-observability-for-the-reasoning-layer-2f7p</guid>
      <description>&lt;p&gt;Three weeks ago someone on the AWS Builders Slack posted something that stopped me cold. Their production AI agent had been running for six hours. CPU normal. Memory stable. Latency within SLO. Zero error rate in CloudWatch.&lt;br&gt;
The agent was re-planning on every single task. One tool kept returning stale data. The agent recognized it, switched tools, got a different failure, re-planned again. It completed tasks — slowly, expensively, with degrading output quality. Nothing in the dashboard moved.&lt;br&gt;
This is not an edge case. This is the default failure mode of agentic AI in production, and your current observability stack cannot see it.&lt;br&gt;
Why OTel Misses the Problem&lt;br&gt;
OpenTelemetry is the best thing that's happened to observability in a decade. Traces, metrics, logs — stable across all three signal types as of the 2026 CNCF milestone. Auto-instrumentation is production-grade. The ecosystem is mature.&lt;br&gt;
And for agent reasoning behavior, it is the wrong level of abstraction.&lt;br&gt;
OTel traces infrastructure execution. A trace shows you: this request arrived, it called this service, that service called this database, the database returned in 42ms, the response went back. Perfect for distributed systems.&lt;br&gt;
An agent doesn't execute a fixed call graph. An agent reasons. It evaluates state, picks a tool, observes the result, decides whether to continue or re-plan, picks another tool. The reasoning path is dynamic. The same input can produce different call graphs on different runs depending on what the tools return.&lt;br&gt;
The key shift is that once agent reasoning is exported into your observability stack, traces stop showing infrastructure execution and start showing reasoning behavior — but only if you're emitting the right data. Kore.ai&lt;br&gt;
Most teams aren't. They're emitting infrastructure spans. The reasoning is invisible.&lt;/p&gt;

&lt;p&gt;The Pattern: Silent Degradation via Re-Planning Loops&lt;br&gt;
Here's what silent agent degradation looks like in a trace when you're not capturing reasoning:&lt;br&gt;
span: agent-task-processor  duration: 4.2s  status: OK&lt;br&gt;
  span: tool-call-cloudwatch  duration: 0.8s  status: OK&lt;br&gt;
  span: tool-call-s3           duration: 0.3s  status: OK&lt;br&gt;
  span: tool-call-cloudwatch  duration: 0.8s  status: OK&lt;br&gt;
  span: tool-call-dynamodb     duration: 0.4s  status: OK&lt;br&gt;
Looks fine. Four tool calls, all successful, task completed.&lt;br&gt;
Here's what's actually happening:&lt;br&gt;
agent receives task&lt;br&gt;
→ plans: use CloudWatch metric X&lt;br&gt;
→ calls CloudWatch: returns stale data (tool succeeds, data is wrong)&lt;br&gt;
→ agent evaluates result: doesn't match expected state&lt;br&gt;
→ RE-PLANS: try DynamoDB instead&lt;br&gt;
→ calls DynamoDB: schema mismatch (tool succeeds, data wrong format)&lt;br&gt;
→ RE-PLANS: back to CloudWatch, different metric&lt;br&gt;
→ calls CloudWatch: stale again&lt;br&gt;
→ RE-PLANS: escalate to human&lt;br&gt;
Four successful spans. Two re-planning cycles. One HER escalation. Zero errors in your monitoring.&lt;br&gt;
This is your RSI (Retry Storm Index) in action — not at the HTTP retry level, but at the reasoning level.&lt;/p&gt;

&lt;p&gt;Introducing Reasoning Trace Depth&lt;br&gt;
I want to introduce a new observable to pair with RSI: Reasoning Trace Depth (RTD).&lt;br&gt;
RTD = the number of re-planning cycles an agent goes through before either completing a task or escalating.&lt;br&gt;
Baseline for a healthy agent on routine tasks: 0–1 re-planning cycles.&lt;br&gt;
Warning threshold: 3+ re-planning cycles.&lt;br&gt;
Critical threshold: 5+ re-planning cycles (agent is effectively stuck).&lt;br&gt;
RTD is your earliest signal. It rises before HER (because the agent is still trying before escalating), before latency becomes visible to users, and before cost metrics show anomalous spend.&lt;br&gt;
pythonfrom dataclasses import dataclass, field&lt;br&gt;
from typing import List, Optional&lt;br&gt;
import time&lt;/p&gt;

&lt;p&gt;@dataclass&lt;br&gt;
class AgentDecisionTrace:&lt;br&gt;
    """&lt;br&gt;
    Structured reasoning trace for a single agent task execution.&lt;br&gt;
    Emitted once per task — NOT once per tool call.&lt;br&gt;
    This is your reasoning observability layer.&lt;br&gt;
    """&lt;br&gt;
    agent_id: str&lt;br&gt;
    session_id: str&lt;br&gt;
    task_id: str&lt;br&gt;
    timestamp: str&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Reasoning behavior
initial_plan: str
tools_called: List[str] = field(default_factory=list)
replan_count: int = 0           # RTD — Reasoning Trace Depth
replan_reasons: List[str] = field(default_factory=list)

# Outcome
task_completed: bool = False
human_escalated: bool = False   # HER signal

# Cost signals
total_tool_calls: int = 0
latency_ms: int = 0

# Quality proxy (if available)
confidence_proxy: Optional[float] = None
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;def emit_decision_trace(trace: AgentDecisionTrace) -&amp;gt; dict:&lt;br&gt;
    """&lt;br&gt;
    Emit structured decision trace to your log aggregator.&lt;br&gt;
    This sits ABOVE your OTel infrastructure spans.&lt;br&gt;
    One entry per agent task — your reasoning observability layer.&lt;br&gt;
    """&lt;br&gt;
    record = {&lt;br&gt;
        "trace_type": "agent_decision",&lt;br&gt;
        "agent_id": trace.agent_id,&lt;br&gt;
        "session_id": trace.session_id,&lt;br&gt;
        "task_id": trace.task_id,&lt;br&gt;
        "timestamp": trace.timestamp,&lt;br&gt;
        "reasoning": {&lt;br&gt;
            "initial_plan": trace.initial_plan,&lt;br&gt;
            "replan_count": trace.replan_count,        # RTD&lt;br&gt;
            "replan_reasons": trace.replan_reasons,&lt;br&gt;
            "tools_sequence": trace.tools_called&lt;br&gt;
        },&lt;br&gt;
        "outcome": {&lt;br&gt;
            "completed": trace.task_completed,&lt;br&gt;
            "human_escalated": trace.human_escalated,  # HER&lt;br&gt;
        },&lt;br&gt;
        "cost": {&lt;br&gt;
            "tool_calls_total": trace.total_tool_calls,&lt;br&gt;
            "latency_ms": trace.latency_ms&lt;br&gt;
        }&lt;br&gt;
    }&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Flag for immediate attention
if trace.replan_count &amp;gt;= 3:
    record["alert"] = "RTD_WARNING"
if trace.replan_count &amp;gt;= 5:
    record["alert"] = "RTD_CRITICAL"

return record
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The Three-Layer Observability Model for Agents&lt;br&gt;
Your current stack has two layers. You need three.&lt;br&gt;
Layer 1 — Infrastructure (you already have this)&lt;br&gt;
OTel traces, Prometheus metrics, structured logs. Tool call latency, error rates, resource utilization. This is what Datadog, Grafana, and CloudWatch show you. It's correct and necessary. It just doesn't see reasoning.&lt;br&gt;
Layer 2 — Control Plane (from Post 7 — RAR, RSI, DCS)&lt;br&gt;
Routing accuracy, retry patterns at the orchestration level, decomposition quality. This is your agent behavior at the workflow level — are tasks being routed correctly? Is the orchestrator stable?&lt;br&gt;
Layer 3 — Reasoning (what's missing)&lt;br&gt;
RTD (Reasoning Trace Depth), re-plan reasons, plan-to-execution delta, decision confidence proxies. One structured log entry per agent task. This is the layer your dashboards don't have.&lt;br&gt;
The diagnostic flow when something feels wrong but dashboards are green:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Check Layer 1: Is infrastructure healthy?&lt;br&gt;
→ Yes → move to Layer 2&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Check Layer 2: Is RSI elevated? Is RAR degraded?&lt;br&gt;
→ RSI elevated → move to Layer 3&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Check Layer 3: Is RTD above baseline?&lt;br&gt;
→ RTD &amp;gt; 3 → agent is re-planning, find the tool/data source causing it&lt;br&gt;
→ RTD normal, HER elevated → agent is escalating cleanly, check decision envelope&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;What This Looks Like in CloudWatch&lt;br&gt;
pythonimport boto3&lt;/p&gt;

&lt;p&gt;cw = boto3.client('cloudwatch', region_name='us-east-1')&lt;/p&gt;

&lt;p&gt;def publish_rtd_metric(agent_id: str, rtd_value: int) -&amp;gt; None:&lt;br&gt;
    """&lt;br&gt;
    Publish Reasoning Trace Depth to CloudWatch.&lt;br&gt;
    Alert when RTD exceeds 3 — agent is re-planning excessively.&lt;br&gt;
    """&lt;br&gt;
    cw.put_metric_data(&lt;br&gt;
        Namespace='AgentSRE/Reasoning',&lt;br&gt;
        MetricData=[{&lt;br&gt;
            'MetricName': 'ReasoningTraceDepth',&lt;br&gt;
            'Dimensions': [{'Name': 'AgentId', 'Value': agent_id}],&lt;br&gt;
            'Value': float(rtd_value),&lt;br&gt;
            'Unit': 'Count'&lt;br&gt;
        }]&lt;br&gt;
    )&lt;br&gt;
Set your alarm at RTD &amp;gt; 3 sustained over a 5-minute window. That's your early warning before HER spikes, before users feel latency, before cost anomalies appear in your billing dashboard.&lt;/p&gt;

&lt;p&gt;The Connection to Your Existing SLI Framework&lt;br&gt;
If you've been following this series:&lt;/p&gt;

&lt;p&gt;Post 4 introduced HER — your human escalation signal. HER is what happens after the agent gives up re-planning.&lt;br&gt;
Post 7 introduced RSI — your retry storm signal at the control plane level.&lt;br&gt;
This post introduces RTD — the earlier, reasoning-level signal that predicts both RSI and HER before they breach.&lt;/p&gt;

&lt;p&gt;RTD → feeds → RSI → feeds → HER&lt;br&gt;
The three form a causal chain. If you're only watching HER, you're watching the end of the chain. RTD gives you the front.&lt;/p&gt;

&lt;p&gt;The Practical Checklist&lt;br&gt;
Before your next agent ships, add to your production-readiness checklist:&lt;br&gt;
☐ Decision trace structured logging configured (one JSON entry per task, not per span)&lt;br&gt;
☐ RTD metric emitting to CloudWatch / Prometheus&lt;br&gt;
☐ RTD baseline established (30-day shadow mode — same as HER baseline protocol)&lt;br&gt;
☐ RTD alarm set at threshold &amp;gt; 3&lt;br&gt;
☐ RTD correlated to HER in your dashboards — rising RTD without rising HER means the agent is struggling but not yet escalating&lt;br&gt;
Your OTel traces are correct. They're just answering the wrong question.&lt;br&gt;
&lt;a href="https://www.linkedin.com/posts/ajay-devineni_sre-agenticai-observability-activity-7462294037518159872-iF29?utm_source=share&amp;amp;utm_medium=member_desktop&amp;amp;rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU" rel="noopener noreferrer"&gt;https://www.linkedin.com/posts/ajay-devineni_sre-agenticai-observability-activity-7462294037518159872-iF29?utm_source=share&amp;amp;utm_medium=member_desktop&amp;amp;rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Ajay Devineni | AWS Community Builder | Senior SRE/Platform Engineer | github.com/Ajay150313/agentsre&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2z9lhkgk1fhuawt4t8zg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2z9lhkgk1fhuawt4t8zg.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>sre</category>
      <category>devops</category>
      <category>platformeng</category>
    </item>
  </channel>
</rss>
