Mukunda Rao Katta

Posted on May 25

Setting Up Agent Observability in 30 Minutes

#hermeschallenge #ai #python #agents

Production agents fail silently. The user says "it didn't work" and you have nothing to look at. No tool call log. No decision trace. No cost record. Just a black box.

This is a 30-minute setup that gives you four things: a snapshot of every tool call, a cost and latency record per run, a live event stream for external monitoring, and a decision log explaining why the agent made each choice.

Each piece is one pip install. They compose around a standard agent loop.

The Four Layers

agentsnap captures what tools were called, with what args, and what they returned.
agenttrace records cost, latency, and input/output tokens per run.
agent-event-bus publishes events so external systems can subscribe without polling.
agent-decision-log records the agent's reasoning at each step.

You don't need all four. Start with one, add the rest when you need them.

Layer 1: agentsnap (10 minutes)

agentsnap wraps your tool calls. Every invocation is recorded with the function name, args, return value, and latency.

from agentsnap import Snap, snapshot

snap = Snap(store_path="~/.myagent/snaps.jsonl")

# Decorate your tool functions
@snapshot(snap)
def search_web(query: str) -> str:
    # your actual implementation
    return fetch_search_results(query)

@snapshot(snap)
def read_file(path: str) -> str:
    with open(path) as f:
        return f.read()

Now call them normally inside your agent loop. Every call writes a structured record:

{
  "ts": "2026-05-24T14:32:01Z",
  "fn": "search_web",
  "args": {"query": "LLM cost per token 2026"},
  "result_preview": "Results from DuckDuckGo: ...",
  "latency_ms": 312,
  "ok": true
}

To review the last run:

from agentsnap import Snap
snap = Snap(store_path="~/.myagent/snaps.jsonl")
for record in snap.load(last_n=50):
    status = "OK" if record["ok"] else "FAIL"
    print(f"[{status}] {record['fn']}({record['args']}) -> {record['latency_ms']}ms")

When a user says "it returned wrong results", you open the snap log and see exactly what the tool was called with and what it returned. The guesswork is gone.

Layer 2: agenttrace (10 minutes)

agenttrace tracks the full run: how many tokens, how much it cost, how long it took.

from agenttrace import Tracer
import anthropic

client = anthropic.Anthropic()
tracer = Tracer(store_path="~/.myagent/traces.jsonl")

def run_agent(user_input: str, session_id: str):
    messages = [{"role": "user", "content": user_input}]

    with tracer.trace(tags={"session_id": session_id}) as span:
        while True:
            response = client.messages.create(
                model="claude-sonnet-4-6",
                max_tokens=4096,
                tools=tools,
                messages=messages,
            )

            span.record(
                input_tokens=response.usage.input_tokens,
                output_tokens=response.usage.output_tokens,
            )

            if response.stop_reason == "end_turn":
                break

            # handle tool calls, update messages
            tool_results = handle_tools(response)
            messages.append({"role": "assistant", "content": response.content})
            messages.append({"role": "user", "content": tool_results})

    # span closes here and writes the full record
    return extract_text(response)

The trace() context manager accumulates token totals across every LLM call in the run. When the context closes, it writes one record with the totals and the wall-clock time for the full run.

# After a few runs, check the summary
from agenttrace import Tracer
tracer = Tracer(store_path="~/.myagent/traces.jsonl")
report = tracer.report()
print(report)
# Runs: 47
# Total cost: $0.82
# Mean latency: 4.2s
# P95 latency: 11.1s

Layer 3: agent-event-bus (5 minutes)

agent-event-bus is an in-process pub/sub. You emit events from inside your agent loop. External subscribers receive them.

This is useful for connecting to Prometheus, Grafana, Slack alerts, or any monitoring system without tight coupling.

from agent_event_bus import EventBus

bus = EventBus()

# Subscribe external systems
@bus.subscribe("agent.tool_called")
def on_tool_call(event):
    metrics.increment("tool_calls_total", tags={"fn": event["fn"]})

@bus.subscribe("agent.run_complete")
def on_run_complete(event):
    metrics.histogram("run_cost_usd", event["cost_usd"])
    if event["cost_usd"] > 0.50:
        slack_alert(f"Expensive run: ${event['cost_usd']:.3f} for session {event['session_id']}")

# Emit from inside your agent loop
def run_agent(user_input: str, session_id: str):
    # ...
    bus.emit("agent.tool_called", {"fn": tool_name, "args": tool_args})
    # ...
    bus.emit("agent.run_complete", {
        "session_id": session_id,
        "cost_usd": total_cost,
        "latency_ms": elapsed_ms,
    })

The bus is synchronous by default. If a subscriber is slow, it blocks the agent. Use EventBus(async_dispatch=True) for non-blocking dispatch where subscribers run in background threads.

The key benefit is decoupling. Your agent code emits events. Your monitoring code subscribes. Neither knows about the other directly. You can add or remove subscribers without touching the agent loop.

Layer 4: agent-decision-log (5 minutes)

The hardest part of agent debugging is understanding why the agent made a choice. Why did it call search_web instead of using information already in context? Why did it generate that answer when the tool returned something different?

agent-decision-log adds a WHY layer to your traces.

from agent_decision_log import DecisionLog

dlog = DecisionLog(store_path="~/.myagent/decisions.jsonl")

def run_agent(user_input: str, session_id: str):
    messages = [{"role": "user", "content": user_input}]

    step = 0
    while True:
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=4096,
            tools=tools,
            messages=messages,
        )

        if response.stop_reason == "tool_use":
            for block in response.content:
                if block.type == "tool_use":
                    dlog.record(
                        session_id=session_id,
                        step=step,
                        decision="tool_call",
                        tool=block.name,
                        args=block.input,
                        # ask the model to explain its choice on the next step
                        context={"message_count": len(messages)},
                    )
        else:
            dlog.record(
                session_id=session_id,
                step=step,
                decision="final_answer",
                context={"stop_reason": response.stop_reason},
            )
            break

        step += 1
        # handle tool calls...

You can also log the model's own reasoning. If you're using extended thinking, capture the thinking block:

for block in response.content:
    if block.type == "thinking":
        dlog.record(
            session_id=session_id,
            step=step,
            decision="thinking",
            reasoning=block.thinking,
        )

The decision log turns a black box into a sequence of annotated steps. When a user asks "why did it do that?", you load the decision log for that session and walk through the steps.

Composing All Four

Here is the full setup wired together:

from agentsnap import Snap, snapshot
from agenttrace import Tracer
from agent_event_bus import EventBus
from agent_decision_log import DecisionLog
import anthropic
import uuid

client = anthropic.Anthropic()
snap = Snap(store_path="~/.myagent/snaps.jsonl")
tracer = Tracer(store_path="~/.myagent/traces.jsonl")
bus = EventBus()
dlog = DecisionLog(store_path="~/.myagent/decisions.jsonl")

@snapshot(snap)
def search_web(query: str) -> str:
    return fetch_search_results(query)

tools = [
    {"name": "search_web", "description": "...", "input_schema": {"type": "object", "properties": {"query": {"type": "string"}}, "required": ["query"]}},
]

def run_agent(user_input: str) -> str:
    session_id = str(uuid.uuid4())
    messages = [{"role": "user", "content": user_input}]
    step = 0
    total_cost = 0.0

    with tracer.trace(tags={"session_id": session_id}) as span:
        while True:
            response = client.messages.create(
                model="claude-sonnet-4-6",
                max_tokens=4096,
                tools=tools,
                messages=messages,
            )

            cost = compute_cost(response.usage)
            total_cost += cost
            span.record(
                input_tokens=response.usage.input_tokens,
                output_tokens=response.usage.output_tokens,
                cost_usd=cost,
            )

            if response.stop_reason == "tool_use":
                tool_results = []
                for block in response.content:
                    if block.type == "tool_use":
                        dlog.record(session_id=session_id, step=step, decision="tool_call", tool=block.name, args=block.input)
                        bus.emit("agent.tool_called", {"session_id": session_id, "fn": block.name})
                        result = search_web(**block.input)
                        tool_results.append({"type": "tool_result", "tool_use_id": block.id, "content": result})

                messages.append({"role": "assistant", "content": response.content})
                messages.append({"role": "user", "content": tool_results})
                step += 1

            else:
                dlog.record(session_id=session_id, step=step, decision="final_answer")
                bus.emit("agent.run_complete", {"session_id": session_id, "cost_usd": total_cost})
                return next((b.text for b in response.content if hasattr(b, "text")), "")

What This Does NOT Do

This setup is local-file-based by default. It does not push to Datadog, Honeycomb, or any SaaS observability platform out of the box. You wire that up in your event bus subscribers.

agent-event-bus is in-process only. Events do not survive process restarts. If you need durable event delivery across services, use a real queue (Redis, RabbitMQ, SQS) and emit to it from your subscribers.

The decision log captures what the agent did, not a ground-truth explanation of why. If you want the model's own reasoning, you need extended thinking enabled.

When This Applies

This setup is for any agent you're running in production where "it's not working" is not enough information to debug. It adds 5-10 lines of code per layer and produces structured files you can inspect without a running service.

Quick Start

pip install agentsnap agenttrace agent-event-bus agent-decision-log

Related Libraries

Library	What It Does	Language
`agentsnap`	Capture tool call args and results per invocation	Python
`agenttrace`	Cost + latency tracing per agent run with tags	Python
`agent-event-bus`	In-process pub/sub for agent events	Python
`agent-decision-log`	WHY-layer decision log per step	Python
`agenttap`	Wire-level prompt introspection	Python
`agent-replay-trace`	Step-through replay of JSONL agent traces	Python

What's Next

Once you have these four layers running, the next step is replay. agent-replay-trace lets you load a JSONL trace and step through it interactively. You can see the exact input, the tool call, and the result at each step. Useful for reproducing bugs and writing regression tests.

For anomaly detection over your trace data, driftvane can flag when cost or latency distribution shifts outside normal bounds across runs.

DEV Community