Your Agent Acts Without Checking Your Error Budget — That's the Failure Mode Nobody Is Tracking

#ai #sre #devops #cursor

Yesterday a piece came out that framed something I've been watching build across production environments for months.
There is a category of production incident that engineering teams are not tracking yet — because it doesn't fit any existing postmortem template. The agent initiated an action. The action was technically correct given the agent's context. The context was incomplete. The infrastructure cascaded. By the time the incident review happened, three teams were arguing about whether it was an agent failure or an infrastructure failure. Kore.ai
That argument happens because the two disciplines — SRE and autonomous agents — have never been formally connected at the decision layer.
Here's the connection I want to make explicit.
What Chaos Engineering Gets Right
Mature chaos engineering programs have a property that's easy to overlook because it's invisible when it's working. Before a human engineer initiates any experiment — a fault injection, a latency spike, a dependency kill — they make a judgment call: does this system have capacity to absorb a perturbation right now?
They check error budget burn rate. They look at whether upstream dependencies are stable. They assess whether the on-call team has bandwidth to respond if something goes wrong. They check whether there's a deploy in flight that makes this a bad time.
That judgment call is informal, often intuitive, and sometimes wrong. But it exists. It's the human-in-the-loop that decides whether the system is in a state to safely absorb autonomous action.
Agents don't make that call. They evaluate their task context, form a plan, and execute. The question "is right now a safe time for this action given the current reliability state of the system?" is not in their decision loop.
The agents delivering production value in 2026 share one defining property: bounded scope. The agent handles one domain, with a defined tool set, and explicitly refuses tasks outside that boundary. The boundary is what makes autonomous deployment safe. GlobeNewswire
Boundary on task scope is necessary. It's not sufficient. You also need a boundary on timing — a gate that checks whether the system's current reliability state can absorb what the agent is about to do.
The Pre-Action SRE Gate
I want to introduce a concrete pattern here: the Pre-Action SRE Gate — a check an agent runs against your existing SRE signals before executing any state-changing action.
The gate has three checks, all using metrics I've built out across this series:
Check 1 — Error Budget Headroom
Before acting, the agent queries current SLO error budget remaining for the services in its blast radius. If error budget is below threshold — the system is already burning faster than acceptable — the agent does not act autonomously. It escalates.
This is the chaos engineering judgment call, formalized as a programmatic check.
Check 2 — AQDD State
Approval Queue Depth Drift tells you whether the human oversight layer is already backed up. If AQDD is elevated — meaning humans can't process approvals fast enough — autonomous action during that window means any mistake won't be caught in time. Agent holds.
Check 3 — HER Trend
If the agent's own Human Escalation Rate has been elevated in the recent window, it's operating outside its reliable envelope. Letting it take autonomous action in that state compounds the risk. Agent escalates.
None of these metrics are new. They're from Post 4 and Post 10 of this series. What's new is using them as gates before action, not just as observability signals after the fact.
python# agentsre/pre_action_gate.py

from dataclasses import dataclass
from typing import Optional
from datetime import datetime, timezone
import json

@dataclass
class SREGateResult:
"""
Result of a Pre-Action SRE Gate check.

If approved is False, the agent must not proceed with
autonomous action — escalate to human owner per ARO record.

Attributes:
    approved: Whether autonomous action is cleared
    blocking_check: Which check blocked (if any)
    error_budget_pct: Current error budget remaining (0-100)
    aqdd_depth: Current approval queue depth
    her_trend: Recent HER rate (0-100)
    recommendation: What the agent should do
    checked_at: Timestamp of gate check
"""
approved: bool
blocking_check: Optional[str]
error_budget_pct: float
aqdd_depth: int
her_trend: float
recommendation: str
checked_at: str

class PreActionSREGate:
"""
Pre-Action SRE Gate — checks your SRE signal state before
an agent executes any autonomous write, remediation, scale
event, or config change.

This is the chaos engineering judgment call, formalized.
A human engineer checks these things before running an experiment.
Your agent should check them before acting autonomously.

Thresholds should be calibrated per agent and task class
in shadow mode — same protocol as HER and RTD baselines.
"""

def __init__(self,
             error_budget_min_pct: float = 20.0,
             aqdd_max_depth: int = 3,
             her_max_trend_pct: float = 15.0):
    """
    Args:
        error_budget_min_pct: Minimum error budget % required
            for autonomous action. Below this = escalate.
            Default 20% — agent should not consume budget
            that's already critically low.
        aqdd_max_depth: Max approval queue depth before
            autonomous action is blocked. Above this,
            humans can't course-correct fast enough.
        her_max_trend_pct: Max recent HER rate before
            autonomous action is blocked. Elevated HER
            means agent is already outside reliable envelope.
    """
    self.error_budget_min_pct = error_budget_min_pct
    self.aqdd_max_depth = aqdd_max_depth
    self.her_max_trend_pct = her_max_trend_pct

def check(self,
          agent_id: str,
          intended_action: str,
          error_budget_pct: float,
          aqdd_depth: int,
          her_trend_pct: float) -> SREGateResult:
    """
    Run pre-action SRE gate check.

    Call this before any autonomous state-changing action.
    If result.approved is False — escalate, do not act.

    Args:
        agent_id: Agent requesting action clearance
        intended_action: Description of what agent plans to do
        error_budget_pct: Current error budget remaining (0-100)
        aqdd_depth: Current approval queue depth
        her_trend_pct: Agent's recent HER rate (0-100)

    Returns:
        SREGateResult with approval decision and reasoning
    """
    # Check 1: Error budget headroom
    if error_budget_pct < self.error_budget_min_pct:
        return SREGateResult(
            approved=False,
            blocking_check="error_budget",
            error_budget_pct=error_budget_pct,
            aqdd_depth=aqdd_depth,
            her_trend=her_trend_pct,
            recommendation=(
                f"Error budget at {error_budget_pct:.1f}% — "
                f"below {self.error_budget_min_pct}% minimum. "
                "Escalate to human owner. Do not act autonomously."
            ),
            checked_at=datetime.now(timezone.utc).isoformat()
        )

    # Check 2: Approval queue state
    if aqdd_depth > self.aqdd_max_depth:
        return SREGateResult(
            approved=False,
            blocking_check="aqdd",
            error_budget_pct=error_budget_pct,
            aqdd_depth=aqdd_depth,
            her_trend=her_trend_pct,
            recommendation=(
                f"Approval queue depth {aqdd_depth} exceeds "
                f"maximum {self.aqdd_max_depth}. "
                "Human oversight is backed up — autonomous action "
                "cannot be safely course-corrected. Hold."
            ),
            checked_at=datetime.now(timezone.utc).isoformat()
        )

    # Check 3: Agent's own HER trend
    if her_trend_pct > self.her_max_trend_pct:
        return SREGateResult(
            approved=False,
            blocking_check="her_trend",
            error_budget_pct=error_budget_pct,
            aqdd_depth=aqdd_depth,
            her_trend=her_trend_pct,
            recommendation=(
                f"HER at {her_trend_pct:.1f}% — "
                f"above {self.her_max_trend_pct}% threshold. "
                "Agent is operating outside reliable envelope. "
                "Escalate rather than act autonomously."
            ),
            checked_at=datetime.now(timezone.utc).isoformat()
        )

    # All checks passed
    return SREGateResult(
        approved=True,
        blocking_check=None,
        error_budget_pct=error_budget_pct,
        aqdd_depth=aqdd_depth,
        her_trend=her_trend_pct,
        recommendation="Autonomous action cleared. Proceed within blast radius.",
        checked_at=datetime.now(timezone.utc).isoformat()
    )

def to_audit_log(self, agent_id: str,
                 intended_action: str,
                 result: SREGateResult) -> dict:
    """
    Structured audit log entry for every gate check.
    Every autonomous action attempt — approved or blocked —
    should be logged. This is your agent action audit trail.
    """
    return {
        "trace_type": "pre_action_gate",
        "agent_id": agent_id,
        "intended_action": intended_action,
        "gate_approved": result.approved,
        "blocking_check": result.blocking_check,
        "sre_signals": {
            "error_budget_pct": result.error_budget_pct,
            "aqdd_depth": result.aqdd_depth,
            "her_trend_pct": result.her_trend,
        },
        "recommendation": result.recommendation,
        "checked_at": result.checked_at,
    }

How This Connects to the Full Arc
Post 4 introduced DQR, TIE, HER, AQDD as observability SLIs — things you watch.
Post 10 introduced ARO — who owns the agent when those SLIs breach.
Post 11 introduced RTD — the reasoning observability layer.
Post 12 introduced CUR — context budget as a reliability ceiling.
This post introduces the Pre-Action SRE Gate — where all of those signals become decision inputs rather than observability outputs. The agent reads your SRE state before acting, not just after.
Resilience requires explicit investment in circuit breakers, graceful degradation, and clear failure modes that preserve system integrity. Teams building agents must invest in resilience infrastructure before pushing to higher-criticality workloads. SourceForge
The Pre-Action Gate is that infrastructure. It's your agent's circuit breaker — not on retry loops or cost, but on system-level reliability state.
The Postmortem Template Gap
79% of organizations now have AI agents in production. Gartner warns 40% of those projects will be canceled due to poor risk controls. The incidents happening in that gap don't fit existing postmortem templates because current templates ask: what changed? who deployed? what failed? Kore.ai
They don't ask: what was the error budget state when the agent acted? Was AQDD elevated, meaning the approval layer was already overwhelmed? Had the agent's HER been trending up, meaning it was already in unreliable territory?
Those questions need to be in your postmortem template. Add a section: Agent Pre-Action State — error budget at time of action, AQDD depth, HER trend. If your postmortem can't answer those three questions, you don't have the data to prevent the same incident from happening again.
The code is in agentsre/pre_action_gate.py on GitHub. MIT licensed, zero external dependencies.
Ajay Devineni | AWS Community Builder | Senior SRE/Platform Engineer
github.com/Ajay150313/agentsre

DEV Community

Your Agent Acts Without Checking Your Error Budget — That's the Failure Mode Nobody Is Tracking

Top comments (0)