Kanish Tyagi

Posted on Apr 10

Agent SRE — SLOs, Error Budgets, and Circuit Breakers for AI Agents

#ai #sre #machinelearning #opensource

When a traditional web service goes down, you get a 500 error. You check the logs, find the exception, fix it, deploy. The failure is deterministic and reproducible.

When an AI agent degrades, it's different. It doesn't crash — it just starts giving worse answers. It calls tools more times than it should. It hallucinates details it used to get right. It slows down under load in ways that are hard to measure. By the time you notice, it's been quietly failing for hours.

This is why AI agents need Site Reliability Engineering. Not the same SRE you apply to APIs and databases — a version adapted to the specific ways agents fail.

How Agents Fail Differently

Traditional services fail in discrete, measurable ways: error rate goes up, latency goes up, availability goes down. You set thresholds, you get paged, you fix it.

Agents fail on a spectrum:

Accuracy degradation — outputs become less correct over time as context accumulates
Tool call inflation — agent starts using 15 tool calls for tasks that used to take 3
Hallucination rate increase — factual errors appear more frequently
Task completion drift — agent completes the literal request but misses the intent
Delegation loops — agent spawns sub-agents that spawn more sub-agents recursively

None of these show up as a 500 error. They require different observability, different thresholds, and different response strategies.

Defining SLOs for Agents

A Service Level Objective is a target for how well your service should perform. For a web API, it's usually availability (99.9%) and latency (p99 < 200ms).

For an agent, you need different dimensions:

from agent_os.sre import AgentSLO

slo = AgentSLO(
    agent_id="research-agent",

    # How often should the agent successfully complete its task?
    task_success_rate=0.95,        # 95% of tasks complete successfully

    # How many tool calls is reasonable per task?
    max_tool_calls_per_task=10,    # alert if average exceeds this

    # How long should a task take?
    max_latency_ms=30000,          # 30 seconds max

    # How often can the agent violate policy?
    max_error_rate=0.02,           # 2% error rate maximum

    # How available should the agent be?
    min_availability=0.99,         # 99% uptime
)

These numbers come from your domain. A research agent doing deep analysis might reasonably use 20 tool calls. A customer service agent answering simple questions should never need more than 3. Define SLOs based on what good behavior looks like for your specific use case.

Error Budgets: When to Throttle vs Keep Running

An error budget is the inverse of your SLO. If your task success SLO is 95%, your error budget is 5% — that's how much failure you can tolerate before taking action.

from agent_os.sre import ErrorBudget

budget = ErrorBudget(slo=slo, window_hours=24)

# After 1000 tasks with 60 failures (6% error rate)
budget.record_tasks(total=1000, failed=60)

print(budget.remaining_percent)  # -20% — budget exhausted
print(budget.status)             # "EXHAUSTED"
print(budget.recommendation)    # "Throttle agent — error budget depleted"

When the error budget is exhausted, you have three options:

Throttle — reduce the agent's call rate, buy time to investigate
Degrade gracefully — switch to a simpler, more reliable mode
Circuit break — stop the agent entirely until the issue is resolved

The right choice depends on the stakes. A research agent exhausting its budget? Throttle it. A financial agent exhausting its budget? Circuit break immediately.

Circuit Breakers: Automatically Stopping Degraded Agents

I built a circuit breaker simulation in the Colab notebook I contributed to this repo. Here's the core concept translated to production code:

from agent_os.sre import CircuitBreaker

breaker = CircuitBreaker(
    agent_id="research-agent",
    failure_threshold=0.05,    # open circuit at 5% error rate
    latency_threshold_ms=30000, # open circuit if avg latency exceeds 30s
    cooldown_seconds=300,       # try again after 5 minutes
)

# Before each agent call
if breaker.is_open():
    # Circuit is open — agent is degraded
    raise AgentUnavailableError("Circuit breaker open — agent degraded")

try:
    result = agent.run(task)
    breaker.record_success()
except Exception as e:
    breaker.record_failure()
    raise

The circuit has three states:

Closed — normal operation, all calls go through
Open — agent is degraded, calls are rejected immediately
Half-open — cooldown expired, testing if agent has recovered

print(breaker.state)  # "CLOSED", "OPEN", or "HALF_OPEN"
print(breaker.failure_rate)  # current failure rate
print(breaker.time_until_retry)  # seconds until half-open

This prevents a degraded agent from silently producing bad outputs at scale. Instead of hoping someone notices the quality drop, the circuit breaker makes the degradation explicit and stops it automatically.

Chaos Testing: Breaking Your Agent on Purpose

The best time to discover how your agent fails is before it fails in production. Chaos testing deliberately introduces failures to find weaknesses.

from agent_os.sre import ChaosEngine

chaos = ChaosEngine(agent=governed_agent)

# Inject 50% tool failure rate
with chaos.inject_tool_failures(rate=0.5):
    results = [governed_agent.run(task) for task in test_tasks]
    print(f"Success rate under chaos: {sum(r.success for r in results)/len(results):.1%}")

# Inject high latency
with chaos.inject_latency(ms=5000):
    results = [governed_agent.run(task) for task in test_tasks]
    print(f"Timeout rate under latency: {sum(r.timed_out for r in results)/len(results):.1%}")

# Inject policy violations
with chaos.inject_policy_violations(rate=0.3):
    results = [governed_agent.run(task) for task in test_tasks]
    print(f"Violation detection rate: {sum(r.violation_caught for r in results)/len(results):.1%}")

Run chaos tests before deploying. The questions you're answering:

Does the circuit breaker actually trip when it should?
Does the agent degrade gracefully or fail catastrophically?
Are error budgets calculated correctly under real failure conditions?
Does the governance layer catch violations during chaos?

What to Monitor and Alert On

Traditional monitoring: CPU, memory, error rate, latency.

Agent monitoring adds:

from agent_os.sre import AgentMetrics

metrics = AgentMetrics(agent_id="research-agent")

# Core reliability metrics
print(metrics.task_success_rate)      # % of tasks completed successfully
print(metrics.avg_tool_calls)         # average tool calls per task
print(metrics.p99_latency_ms)         # 99th percentile task latency
print(metrics.error_budget_remaining) # % of error budget left

# Agent-specific metrics
print(metrics.hallucination_rate)     # estimated factual error rate
print(metrics.policy_violation_rate)  # % of calls blocked by governance
print(metrics.delegation_depth_avg)   # average sub-agent spawn depth

# Circuit breaker state
print(metrics.circuit_state)          # CLOSED / OPEN / HALF_OPEN
print(metrics.circuit_opens_24h)      # how many times circuit opened today

Alert thresholds I'd recommend starting with:

Metric	Warning	Critical
Task success rate	< 95%	< 90%
Avg tool calls	> 15	> 25
Error budget remaining	< 25%	< 10%
Circuit opens (24h)	> 3	> 10
Policy violation rate	> 5%	> 15%

AccuracyDeclaration: Formally Declaring Agent Accuracy

One underused feature of the toolkit is AccuracyDeclaration — a formal, versioned statement of what accuracy levels your agent guarantees:

from agent_os.sre import AccuracyDeclaration

declaration = AccuracyDeclaration(
    agent_id="research-agent",
    version="1.2.0",
    declared_accuracy={
        "factual_retrieval": 0.94,    # 94% accuracy on factual questions
        "task_completion": 0.97,      # 97% of tasks complete successfully
        "policy_compliance": 0.999,   # 99.9% policy compliance
    },
    measurement_method="automated_eval_suite_v3",
    valid_for_days=90,
    supersedes="1.1.0",
)

declaration.publish()

This serves two purposes. First, it creates accountability — you've formally stated what your agent guarantees, and you're tracking against it. Second, it enables automated degradation detection — if current metrics fall below declared accuracy, alert immediately.

The SRE Mindset for Agent Teams

Traditional SRE asks: "How do we keep this service running?"

Agent SRE asks: "How do we keep this agent correct?"

Running and correct are different things for AI systems. An agent can be running — responding to every request, returning outputs — while being completely wrong. That's the failure mode traditional SRE doesn't catch.

The principles are the same: define what good looks like (SLOs), measure continuously (metrics), respond automatically to degradation (circuit breakers), and test your failure modes before they happen (chaos testing). The implementation is different because the failure modes are different.

If you're deploying agents in production and you don't have SLOs defined for them, you don't actually know if they're working. You just know they're running.

Getting Started

pip install agent-governance-toolkit[full]

The interactive Colab notebook I built for this repo walks through SLOs, circuit breakers, and chaos testing with live code:

👉 github.com/microsoft/agent-governance-toolkit

I'm Kanish Tyagi — MS Data Science student at UT Arlington, open source contributor to Microsoft's agent-governance-toolkit. Find me on GitHub and LinkedIn.

DEV Community