Mukunda Rao Katta

Posted on May 25

Five Observability Layers Every Production Agent Needs

#hermeschallenge #ai #python #agents

"The Agent Did Something Wrong"

That sentence starts most production incidents. It is also almost useless for debugging.

Which wrong thing? At which step? With what input? Costing how much? Based on which source?

An agent is a loop. It calls tools. It reads results. It decides. It calls tools again. At any step, something can go sideways. Without instrumentation, you are guessing.

Observability for LLM agents is not a single concern. It is five different questions, each answered by a different layer:

CALLS: what tools were called, with what args, and what came back
COST: what did each run cost in tokens and dollars
WHY: why did the agent make each decision
WHERE: where did each fact in the final answer come from
WIRE: what exact JSON bytes went to and came back from the API

This post shows what each layer covers and how to compose all five.

Main Code Example

import asyncio
from agentsnap import Snap, SnapStore          # CALLS layer
from agenttrace import Tracer, RunRecord       # COST layer
from agent_decision_log import DecisionLog     # WHY layer
from agent_citation import CitationTracker     # WHERE layer
from agenttap import WireTap                   # WIRE layer

# Initialize all five layers
snap_store = SnapStore(path="./runs/snaps")       # CALLS: structured snapshots
tracer = Tracer(tag="research-agent")             # COST: per-run cost records
decision_log = DecisionLog(path="./runs/why")     # WHY: decision JSONL
citations = CitationTracker()                     # WHERE: fact provenance
wire_tap = WireTap(path="./runs/wire")            # WIRE: raw API payloads


async def observed_llm_call(
    messages: list[dict],
    run_id: str,
) -> object:
    """LLM call with all five observability layers active."""
    # WIRE: capture request before sending
    wire_tap.capture_request(run_id=run_id, payload=messages)

    response = await your_llm_client(messages)

    # WIRE: capture response immediately after
    wire_tap.capture_response(run_id=run_id, payload=response.raw)

    # COST: record token usage
    tracer.record(
        run_id=run_id,
        input_tokens=response.usage.input_tokens,
        output_tokens=response.usage.output_tokens,
        model=response.model,
    )

    return response


async def observed_tool_call(
    run_id: str,
    tool_name: str,
    args: dict,
) -> object:
    """Tool dispatch with CALLS layer capture."""
    snap = Snap(run_id=run_id, tool=tool_name, args=args)

    try:
        result = await dispatch_tool(tool_name, args)
        snap.result = result
        snap.ok = True
    except Exception as e:
        snap.error = str(e)
        snap.ok = False
        raise
    finally:
        snap_store.save(snap)  # always write the snap, success or failure

    # WHERE: if this tool returned a source, register it
    if isinstance(result, dict) and "source_url" in result:
        citations.register(
            run_id=run_id,
            fact_key=tool_name,
            source=result["source_url"],
            snippet=str(result.get("content", ""))[:200],
        )

    return result


async def run_agent(task: str) -> str:
    run_id = tracer.start_run(task=task)
    messages = [{"role": "user", "content": task}]

    try:
        for _ in range(20):  # max 20 turns
            response = await observed_llm_call(messages, run_id)
            messages.append({"role": "assistant", "content": response.content})

            if response.tool_calls:
                for tc in response.tool_calls:
                    raw = await observed_tool_call(run_id, tc.name, tc.args)

                    # WHY: log the decision that led to this tool call
                    decision_log.record(
                        run_id=run_id,
                        step=tc.name,
                        reasoning=response.content,  # agent's prior text
                        chosen_action=tc.name,
                        args=tc.args,
                    )

                    messages.append({
                        "role": "tool",
                        "tool_call_id": tc.id,
                        "content": str(raw),
                    })
            else:
                # Agent is done
                break

        final_answer = response.content

        # WHERE: attach citations to the final answer
        cited = citations.annotate(run_id=run_id, text=final_answer)

        return cited

    finally:
        # COST: finalize run record regardless of success/failure
        run_record: RunRecord = tracer.end_run(run_id)
        print(f"Run {run_id}: ${run_record.cost_usd:.4f}, "
              f"{run_record.total_tokens:,} tokens, "
              f"{run_record.duration_s:.1f}s")


async def main():
    result = await run_agent("Research the latest Python async patterns.")
    print(result)


if __name__ == "__main__":
    asyncio.run(main())

Each layer writes to a separate file store. After a run, you have five files you can inspect independently. You do not need all five open at once. When debugging a cost spike, open the tracer records. When the agent gave a wrong answer, open the decision log. When you need the raw API call for a bug report, open the wire tap.

What Each Layer Actually Tells You

CALLS (agentsnap): "The search tool was called 8 times in this run. On turn 4, it was called with query='Python async' and returned 1,200 chars. On turn 7, the same query was called again with identical args." This is where you catch loops, redundant calls, and unexpected argument values.

COST (agenttrace): "This run cost $0.0043. The expensive turn was turn 12, which used 3,400 input tokens because the context had grown large by then." This is where you find where money goes.

WHY (agent-decision-log): "Before calling the summarize tool on turn 9, the agent's reasoning text was: 'I have gathered enough data. I will now summarize the top three findings.'" This is where you understand intent.

WHERE (agent-citation): "The phrase 'async context managers were introduced in Python 3.5' came from the source returned by the docs-search tool on turn 3, URL: docs.python.org/3/..." This is where you verify factual grounding.

WIRE (agenttap): "The exact request JSON that was sent to the API at 14:32:01.447 was: {...}." This is where you file bug reports, compare across model versions, or replay a broken call.

What This Does NOT Do

This setup does not aggregate across runs. Each layer writes per-run files. To see trends over 1,000 runs, you need to load those files into a database or analytics tool. The libraries give you the per-run data; aggregation is your responsibility.

It does not add automatic alerting. If cost per run suddenly doubles, nothing here pages you. You need to read the RunRecord.cost_usd field and compare it to a threshold. The agent-event-bus library can emit events that feed into alerting.

It does not replay failed runs automatically. The wire tap captures the exact request. You can manually replay it with agenttap.replay(run_id). But automated replay on failure is a separate concern.

It does not handle multi-agent or multi-process tracing. If agent A spawns agent B, the run_id scoping is your job. Pass the parent run_id to child agents and prefix their records.

Design Reasoning

Five separate libraries instead of one unified tracing framework. The reason is simple: you rarely need all five at the same time.

A cost spike investigation needs COST and CALLS. A factual accuracy complaint needs WHERE and WHY. A provider bug report needs WIRE. If everything is in one large trace object, you sift through noise to find the relevant signal.

Separate libraries also deploy independently. Maybe you add WHY layer logging only in staging. Maybe WIRE is too sensitive to keep in production logs (it contains the full prompt). You can enable and disable each layer without touching the others.

All five write to files, not to a running service. That means no network dependency during the agent run. If your logging service is down, the agent still runs. The data is local, durable, and inspectable with a text editor.

When This Applies

Production agents where something goes wrong and you need to know exactly what happened. The five layers give you the full picture without requiring a browser-based APM tool.

Agents in regulated environments where you need an audit trail. The WHERE layer proves where each fact came from. The WHY layer proves what reasoning was used. Both are JSONL files you can keep as records.

Development and debugging where you want to understand agent behavior before shipping. Run the agent locally with all five layers. Read the decision log. Find the turn where it went off track.

This does NOT apply to agents that run in 2-3 turns. The overhead is not worth it for simple, fast workflows.

Quick-Start Snippet

pip install agentsnap agenttrace agent-decision-log agent-citation agenttap

from agentsnap import SnapStore
from agenttrace import Tracer
from agent_decision_log import DecisionLog
from agent_citation import CitationTracker
from agenttap import WireTap

store = SnapStore("./runs/snaps")
tracer = Tracer(tag="my-agent")
decisions = DecisionLog("./runs/why")
citations = CitationTracker()
wire = WireTap("./runs/wire")

# Wrap your LLM calls and tool dispatch with the pattern above.

All five initialized in five lines. Files go to ./runs/. Inspect with any JSONL viewer.

Siblings

Library	Layer	Primary question answered
`agentsnap`	CALLS	What tools were called, args, results
`agenttrace`	COST	Tokens, dollars, latency per run
`agent-decision-log`	WHY	What reasoning led to each action
`agent-citation`	WHERE	What source backs each claim
`agenttap`	WIRE	Exact bytes sent to and from the API
`agent-event-bus`	EVENTS	Emit structured events for downstream alerting

What's Next

The next step is aggregation. Load all RunRecord objects from agenttrace into a pandas DataFrame. Group by date, feature, model, or user. Plot cost over time. That turns per-run data into operational intelligence.

For the WHY layer, the decision log format is JSONL. You can feed it into a language model that summarizes agent reasoning patterns across 100 runs. "The agent frequently chooses the summarize tool when the context exceeds 10,000 tokens." That kind of meta-observation is hard to see from a single run.

For WIRE replays, agenttap stores the full request payload. You can replay it against a different model version to compare outputs. That is a lightweight way to A/B test model upgrades without building a formal eval pipeline.

All five layers produce independent, inspectable files. You can start with just COST and CALLS. Add the others when you need them. There is no requirement to run all five from day one.

DEV Community