Mukunda Rao Katta

Posted on May 25

Monitor Your Agent's Health in Production

#hermeschallenge #ai #python #agents

What Does "Agent Health" Mean?

A web server is healthy if it returns 200s in under 200ms. That definition does not work for agents.

An agent can return a response and still be unhealthy. It might be running loops (calling the same tool 20 times). It might be consuming 10x the expected tokens because the context window grew uncontrolled. It might be using a cached circuit breaker that has been open for six hours because of a provider outage nobody noticed.

Agent health is not just "is it up." It is:

Is it cost-efficient right now?
Is it hitting the cache?
Is it in a loop?
Has the circuit breaker opened?
Is it approaching the budget limit?
What kinds of tool errors is it seeing?

This post shows how to collect all six of these signals using libraries from the agent stack and emit them to a monitoring system.

The Six Health Signals

Signal 1: Cost per run (agenttrace). Baseline what a normal run costs. Alert when a run costs 3x the baseline. That is the first sign of a prompt or context problem.

Signal 2: Cache hit ratio (cachebench). If prompt caching is enabled and the hit ratio drops below 60%, something changed in the prompt that breaks the cached prefix. That is worth an alert.

Signal 3: Circuit breaker state (llm-circuit-breaker-py). An open circuit means all calls are failing. A half-open circuit means the agent is in recovery mode. Expose the breaker state as a health metric.

Signal 4: Budget utilization (token-budget-py). Track reserved tokens versus committed tokens. If reserve regularly exceeds commit by more than 50%, your estimates are off. If committed tokens approach the session cap, the agent is running long.

Signal 5: Loop detection frequency (tool-loop-guard). Every time tool-loop-guard fires, the agent tried to call the same tool with the same args again. Log the frequency. More than 2-3 triggers per run is a behavior problem, not a one-off.

Signal 6: Tool error rate by kind (tool-error-classify). Group tool errors by ErrorKind (TIMEOUT, INVALID_ARGS, PERMISSION_DENIED, RATE_LIMITED, etc.). A sudden spike in RATE_LIMITED errors means your upstream is throttling. TIMEOUT spikes mean a dependency is slow.

Main Code Example

import asyncio
import time
from agenttrace import Tracer, RunRecord
from cachebench import CacheBench
from llm_circuit_breaker_py import CircuitBreaker, CircuitState
from token_budget_py import TokenBudget
from tool_loop_guard import LoopGuard
from tool_error_classify import classify_error, ErrorKind
from agent_event_bus import EventBus, Event

# All six health signal sources
tracer = Tracer(tag="prod-agent")
bench = CacheBench(session_path="./health/cache.jsonl")
breaker = CircuitBreaker(
    failure_threshold=5,
    recovery_timeout=60,
    half_open_calls=2,
)
budget = TokenBudget(max_tokens=100_000)
loop_guard = LoopGuard(
    max_repeats=3,
    window_turns=10,
)

# Event bus: downstream consumers subscribe to health events
bus = EventBus()

# External monitoring webhook (Datadog, PagerDuty, etc.)
MONITORING_WEBHOOK = "https://your-monitoring.example.com/events"


async def emit_health_event(metric: str, value: float, tags: dict):
    """Emit a health event to the bus and to external monitoring."""
    event = Event(metric=metric, value=value, tags=tags, ts=time.time())
    await bus.emit("health", event)

    # In production: push to your monitoring system
    # await httpx.post(MONITORING_WEBHOOK, json=event.to_dict())
    print(f"[health] {metric}={value:.4f} tags={tags}")


async def observed_llm_call(
    run_id: str,
    messages: list[dict],
    model: str,
) -> object:
    """LLM call with circuit breaker and cache tracking."""
    if breaker.state == CircuitState.OPEN:
        # Emit alert: circuit is open
        await emit_health_event(
            metric="circuit_breaker_open",
            value=1.0,
            tags={"model": model, "run_id": run_id},
        )
        raise RuntimeError("Circuit breaker is open. Calls are paused.")

    with bench.session(feature="agent", model=model) as cache_session:
        try:
            response = await breaker.call(your_llm_client, messages)
            cache_session.record(
                cache_read_tokens=getattr(response.usage, "cache_read_input_tokens", 0),
                cache_write_tokens=getattr(response.usage, "cache_creation_input_tokens", 0),
                input_tokens=response.usage.input_tokens,
                output_tokens=response.usage.output_tokens,
            )
            return response

        except Exception as e:
            # Circuit breaker will count this failure automatically
            raise


async def observed_tool_call(
    run_id: str,
    turn: int,
    tool_name: str,
    args: dict,
) -> object:
    """Tool call with loop detection and error classification."""
    # Loop guard: raises if same tool+args called too many times recently
    try:
        loop_guard.check(tool_name=tool_name, args=args, turn=turn)
    except Exception as loop_err:
        await emit_health_event(
            metric="loop_detection_triggered",
            value=1.0,
            tags={"tool": tool_name, "run_id": run_id},
        )
        raise

    try:
        result = await dispatch_tool(tool_name, args)
        return result

    except Exception as raw_err:
        kind: ErrorKind = classify_error(raw_err)
        await emit_health_event(
            metric="tool_error",
            value=1.0,
            tags={
                "tool": tool_name,
                "error_kind": kind.value,
                "run_id": run_id,
            },
        )
        raise


async def run_agent(task: str, model: str = "claude-sonnet-4-6") -> str:
    run_id = tracer.start_run(metadata={"task": task, "model": model})
    messages = [{"role": "user", "content": task}]
    turn = 0

    try:
        for _ in range(20):
            turn += 1

            # Check budget utilization before each call
            utilization = budget.committed / budget.max_tokens
            if utilization > 0.8:
                await emit_health_event(
                    metric="budget_utilization",
                    value=utilization,
                    tags={"run_id": run_id, "model": model},
                )

            response = await observed_llm_call(run_id, messages, model)
            messages.append({"role": "assistant", "content": response.content})

            if response.tool_calls:
                for tc in response.tool_calls:
                    estimated_tokens = int(len(str(tc.args)) * 1.3) + 200
                    budget.reserve(estimated_tokens)

                    raw = await observed_tool_call(run_id, turn, tc.name, tc.args)
                    budget.commit(estimated_tokens, estimated_tokens)

                    messages.append({
                        "role": "tool",
                        "tool_call_id": tc.id,
                        "content": str(raw),
                    })
            else:
                break

        return response.content

    finally:
        run_record: RunRecord = tracer.end_run(
            run_id,
            input_tokens=sum(
                getattr(r, "input_tokens", 0)
                for r in tracer.get_run_records(run_id)
            ),
            output_tokens=0,
            model=model,
        )

        # Emit final health metrics for this run
        await emit_health_event(
            metric="cost_per_run",
            value=run_record.cost_usd,
            tags={"model": model},
        )

        cache_report = bench.report()
        for feature, metrics in cache_report.items():
            await emit_health_event(
                metric="cache_hit_ratio",
                value=metrics["hit_ratio"],
                tags={"feature": feature, "model": model},
            )

        # Breaker state as a numeric: 0=closed, 0.5=half-open, 1=open
        state_value = {
            CircuitState.CLOSED: 0.0,
            CircuitState.HALF_OPEN: 0.5,
            CircuitState.OPEN: 1.0,
        }[breaker.state]
        await emit_health_event(
            metric="circuit_breaker_state",
            value=state_value,
            tags={"model": model},
        )


async def main():
    await run_agent("Research the latest Python packaging standards.")


if __name__ == "__main__":
    asyncio.run(main())

Every health event goes to the EventBus. Subscribers get structured events with metric name, value, tags, and timestamp. You can subscribe in process or forward to an external webhook.

What This Does NOT Do

This does not auto-remediate. If the circuit breaker opens, something needs to close it. The breaker transitions to half-open automatically after recovery_timeout seconds, and closes if the next call succeeds. But the root cause (provider issue, bad prompt) is yours to fix.

It does not aggregate across agent processes. Each process has its own circuit breaker state and loop guard state. If you run 20 workers, worker 1's breaker opening does not affect worker 2's breaker. For fleet-level health, push events to a shared store.

It does not set alert thresholds for you. You decide what "cost_per_run > 0.05 USD" means for your system. The libraries emit the values. You configure the thresholds in your monitoring tool.

It does not track human feedback. If users rate agent outputs, that signal is not captured here. Combine these technical health signals with user feedback pipelines for a complete picture.

Design Reasoning

Six separate libraries instead of one health framework. Each signal has a different owner. The circuit breaker cares about failure counts. The loop guard cares about recent call patterns. The cache bench cares about token cache headers. Unifying them into one object couples unrelated concerns.

The EventBus is the integration point. Each library produces events. The bus fans them out to subscribers. You can add a subscriber that writes to Prometheus, another that POSTs to PagerDuty, another that writes to a local log file. The agent code does not change.

Health events at run-end AND inline. Cost per run is only meaningful after the run completes. But budget utilization and circuit breaker state are meaningful inline, before each call. Both emission points matter.

Numeric encoding for categorical state. Circuit breaker state is an enum, but monitoring systems prefer numbers. 0.0 = CLOSED, 0.5 = HALF_OPEN, 1.0 = OPEN makes it trivial to alert when the value exceeds 0.0.

When This Applies

Production agents that run without supervision. Background workers, scheduled jobs, multi-user services. Any agent that needs someone to know when it is struggling.

Teams that already have a monitoring stack (Datadog, Prometheus, Grafana, PagerDuty). This pattern feeds structured events from the agent into whatever you already use. You do not need a new dashboard tool.

High-volume agent deployments where individual run inspection is impractical. When you have 10,000 runs per day, you need aggregate metrics. The event stream aggregates naturally.

This does NOT fit agents you are actively developing and debugging locally. For local debugging, use the five observability layers from the previous post in this series. Health monitoring is for deployed production systems.

Quick-Start Snippet

pip install agenttrace cachebench llm-circuit-breaker-py token-budget-py tool-loop-guard tool-error-classify agent-event-bus

from llm_circuit_breaker_py import CircuitBreaker
from tool_loop_guard import LoopGuard
from agent_event_bus import EventBus

breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=60)
loop_guard = LoopGuard(max_repeats=3, window_turns=10)
bus = EventBus()

# Wrap your LLM calls with breaker.call()
# Check loop_guard.check() before each tool dispatch
# Subscribe to bus.subscribe("health", handler) for external reporting

Siblings

Library	Health signal	What to alert on
`agenttrace`	Cost per run	Run cost > 3x rolling average
`cachebench`	Cache hit ratio	Hit ratio drops below 60%
`llm-circuit-breaker-py`	Breaker state	State transitions to OPEN
`token-budget-py`	Budget utilization	Utilization > 80% of session cap
`tool-loop-guard`	Loop trigger frequency	More than 2 triggers per run
`tool-error-classify`	Tool error kind	Spike in RATE_LIMITED or TIMEOUT
`agent-event-bus`	Event routing	Backbone for all health events

What's Next

The missing piece after health monitoring is health prediction. If cost per run has been increasing by 5% per day for the past week, the trend will hit your alert threshold in three days. Emitting current values is reactive. Building a forecast from the time series is proactive.

For the circuit breaker specifically, consider tracking why it opens. The llm-circuit-breaker-py breaker knows the exception type that tripped it. Log that alongside the state change event. Over time, you learn whether most outages are timeouts, 429s, or 500s from the provider. That changes your remediation strategy.

For teams with SLAs, set the alert thresholds from SLA math first. If your SLA requires 95% of agent runs to complete under 30 seconds and under $0.01, your health alerts should fire when either metric drifts toward that boundary, not after it is already broken.

The six signals in this post cover the mechanical health of the agent loop. Pair them with product metrics (task completion rate, user satisfaction) for the full operational picture.

DEV Community