WonderLab

Posted on Jun 6

Agent Series (14): Agent Observability — Tracing Every Decision, Making the Black Box Transparent

#ai #llm #langchain #agents

The Agent Black Box

You send a request to your Agent. Six seconds later, you get an answer.

What happened during those six seconds?

How many times did the LLM think?
Which tools were called, with what arguments, returning what?
Where did the latency actually live?
If the answer was wrong, which step caused it?

A plain agent.invoke() tells you nothing. Agent observability solves this: it makes every decision and its cost visible.

Three Observability Levels

Development   →  Live Trace        (see what's happening)
Analysis      →  Latency Timeline  (see where time goes)
Production    →  Audit Log         (record each request, replayable)

All three patterns share one foundation: LangChain's BaseCallbackHandler.

The Core: AgentTracer

BaseCallbackHandler provides hooks for the full lifecycle of LLM calls and tool invocations:

from langchain_core.callbacks import BaseCallbackHandler
from langchain_core.outputs import LLMResult

class AgentTracer(BaseCallbackHandler):
    def __init__(self, verbose: bool = True, trace_id: str = "") -> None:
        super().__init__()
        self.verbose = verbose
        self.trace_id = trace_id or str(uuid.uuid4())[:8]
        self.steps: list[StepRecord] = []
        self._llm_t0: float = 0.0
        self._tool_t0: float = 0.0
        self._tool_name: str = ""

    # Chat models fire on_chat_model_start, not on_llm_start
    def on_chat_model_start(self, serialized, messages, **kwargs) -> None:
        self._llm_t0 = time.time()
        if self.verbose:
            print(f"  [LLM →] reasoning...")

    def on_llm_end(self, response: LLMResult, **kwargs) -> None:
        t1 = time.time()
        # Tool-calling turns have empty content; only the final turn has text
        output = _extract_text(response)
        self.steps.append(StepRecord(
            step_type="llm", name="LLM",
            output_preview=output,
            start_time=self._llm_t0, end_time=t1,
        ))
        if self.verbose:
            print(f"  [LLM ←] {t1 - self._llm_t0:.0f}ms  |  {output[:70]}")

    def on_tool_start(self, serialized, input_str: str, **kwargs) -> None:
        self._tool_t0 = time.time()
        self._tool_name = serialized.get("name", "tool")
        if self.verbose:
            print(f"  [TOOL→] {self._tool_name}({str(input_str)[:60]})")

    def on_tool_end(self, output: str, **kwargs) -> None:
        t1 = time.time()
        self.steps.append(StepRecord(
            step_type="tool", name=self._tool_name,
            output_preview=str(output)[:80],
            start_time=self._tool_t0, end_time=t1,
        ))
        if self.verbose:
            print(f"  [TOOL←] {str(output)[:80]}  [{(t1-self._tool_t0)*1000:.0f}ms]")

One important detail: chat models fire on_chat_model_start, not on_llm_start. Implement both for full model-type coverage.

Pass the tracer via config when invoking:

tracer = AgentTracer(verbose=True)
result = agent.invoke(
    {"messages": [HumanMessage(query)]},
    config={"callbacks": [tracer]},
)

Demo 1: Live Trace

Query: "What is the weather in Beijing and Shanghai? Calculate the temperature difference."

Real output:

  [LLM →] reasoning...
  [LLM ←] 1544ms  |                              ← tool-calling turn, content empty
  [TOOL→] get_weather({'city': 'Beijing'})
  [TOOL←] {"city": "Beijing", "temp": 25, "condition": "sunny"}  [1ms]
  [LLM →] reasoning...
  [LLM ←] 1236ms  |                              ← tool-calling turn, content empty
  [TOOL→] get_weather({'city': 'Shanghai'})
  [TOOL←] {"city": "Shanghai", "temp": 22, "condition": "cloudy"}  [1ms]
  [LLM →] reasoning...
  [LLM ←] 2895ms  |                              ← tool-calling turn, content empty
  [TOOL→] calculator({'expression': '25.0-22.0'})
  [TOOL←] 25.0-22.0 = 3.0  [1ms]
  [LLM →] reasoning...
  [LLM ←] 2465ms  |  The temperature in Beijing is 25°C and Shanghai is 22°C...

Trace summary: 7 steps  (4 LLM calls, 3 tool calls)

Why is output_preview empty for the middle LLM steps?

In a ReAct Agent, every time the LLM decides to call a tool, it outputs tool_calls JSON — not text. That JSON is consumed by the LangGraph framework and never appears in the content text field. Only the final LLM turn (when the model decides to stop calling tools and answer directly) produces text content.

This is exactly the value of a trace: without it you only know "it took 8 seconds." With it you know: 4 LLM calls and 3 tool calls, and which reasoning step took the longest.

Demo 2: Step Timeline

Query: "Tell me the WonderBot Pro price and calculate 299 * 12 for annual cost."

Run silently, then analyze tracer.steps to build an ASCII timeline:

total_ms  = sum(s.duration_ms for s in tracer.steps)
bar_scale = 40 / total_ms

for i, step in enumerate(tracer.steps, 1):
    bar   = "█" * max(int(step.duration_ms * bar_scale), 1)
    label = "LLM reasoning" if step.step_type == "llm" else f"tool: {step.name}"
    print(f"  Step {i}  {label:<25} [{step.duration_ms:>6.0f}ms]  {bar}")

Real output:

Step-by-step breakdown:

  Step 1  LLM reasoning             [  2409ms]  ██████████████
  Step 2  tool: get_product_info    [     1ms]  █
  Step 3  LLM reasoning             [  2069ms]  ████████████
  Step 4  tool: calculator          [     1ms]  █
  Step 5  LLM reasoning             [  1977ms]  ████████████

  ────────────────────────────────────────────────────────────
  Total : 6457ms
  LLM   : 6455ms  (100.0% of wall time)
  Tools :    2ms  (0.0% of wall time)

LLM = 100%. Tools = 0%.

This is the most important insight for Agent performance optimization: tool calls are essentially free (2ms). All latency comes from the LLM. When you want to speed up an Agent, reduce the number of LLM calls — fewer reasoning steps, merged requests, cached responses. Optimizing tool implementations accomplishes almost nothing.

Demo 3: Structured Audit Log

Query: "What's the weather in Shenzhen and how much does WonderBot Basic cost?"

After the run, serialize tracer.steps to JSON:

def build_audit_log(tracer: AgentTracer, query: str, answer: str) -> dict:
    steps_log = []
    for s in tracer.steps:
        entry = {"type": s.step_type, "duration_ms": round(s.duration_ms, 1)}
        if s.step_type == "tool":
            entry["tool"]   = s.name
            entry["input"]  = s.input_preview
            entry["output"] = s.output_preview
        else:
            entry["output_preview"] = s.output_preview
        steps_log.append(entry)
    ...

Real output:

{
  "trace_id": "6c8e4b64",
  "query": "What's the weather in Shenzhen and how much does WonderBot Basic cost?",
  "answer": "...The cost of WonderBot Basic is not available.",
  "steps": [
    {"type": "llm",  "duration_ms": 593.0,  "output_preview": ""},
    {"type": "tool", "duration_ms": 1.1,    "tool": "get_weather",
     "input": "{'city': 'Shenzhen'}", "output": "..."},
    {"type": "llm",  "duration_ms": 1164.9, "output_preview": ""},
    {"type": "tool", "duration_ms": 1.0,    "tool": "get_product_info",
     "input": "{'product_name': 'Basic'}", "output": "Product 'Basic' not found..."},
    {"type": "llm",  "duration_ms": 1190.6, "output_preview": "The weather in Shenzhen..."}
  ],
  "summary": {
    "step_count": 5, "tool_call_count": 2,
    "total_ms": 2950.6, "llm_ms": 2948.5, "tool_ms": 2.0
  }
}

Look at step 4: get_product_info received {'product_name': 'Basic'} — the LLM passed "Basic" without the "wonderbot" prefix, so the lookup failed. The audit log captured the raw tool arguments. Without it, all you'd see in the final answer is "not available," with no idea why. With the log, you can see exactly which parameter the LLM chose to pass.

This is why audit logs record tool inputs, not just tool names.

Pattern Comparison

Pattern         Use case                     Perf impact   Output
──────────────────────────────────────────────────────────────────────
Live trace      Development, debugging       Yes (I/O)     Console event stream
Timeline        Performance analysis         No (post)     ASCII chart
Audit log       Production, compliance       No (post)     JSON per request

All three share one AgentTracer. Set verbose=True for live output during development, verbose=False for silent recording in production.

Design Checklist

Tracer Implementation

[ ] Implement on_chat_model_start, not just on_llm_start (safer: implement both)
[ ] In on_tool_start, save the tool name and input; use those saved values in on_tool_end
[ ] Use start_time / end_time instead of computing delta in on_end — supports nested calls
[ ] Make StepRecord a dataclass for easy serialization and downstream analysis

Audit Log

[ ] Generate a unique trace_id per request (first 8 chars of a UUID is enough)
[ ] Record raw tool input arguments, not just tool name — that's where bugs hide
[ ] Expect empty output_preview for intermediate LLM steps: tool-calling turns have no text content
[ ] In production, write audit logs to a database or logging system, not just stdout

Performance Analysis

[ ] Always separate llm_ms from tool_ms in your metrics
[ ] Optimization priority: fewer LLM calls >> faster tool execution
[ ] Establish p50/p95 latency baselines to detect regressions after Agent changes

Summary

Five core takeaways:

LangChain callbacks are the foundation of Agent observability: four methods on BaseCallbackHandler cover the full LLM and tool lifecycle
Chat models fire on_chat_model_start: easy-to-miss gotcha — on_llm_start never fires for ChatOpenAI
Intermediate LLM steps have empty content: tool-calling turns produce tool_calls JSON, not text — this is normal behavior
LLM = 99-100% of Agent latency: tools are in the 1-2ms range; optimize by reducing LLM calls, not tool implementations
Audit log value = tool input parameters: the final answer alone isn't enough — the raw arguments the LLM chose to pass are what you need for debugging

Up next: Advanced Agent Memory — short-term memory, long-term memory, summary compression, and how to give an Agent context that persists across sessions.

References

LangChain Callbacks documentation
LangGraph documentation
Full demo code for this series: agent-13-observability

Check out PrimeSkills — a curated marketplace of AI agents and skills that have been validated in real-world, enterprise-grade workflows. No fluff, just what actually works.

Find more useful knowledge and interesting products on my Homepage

DEV Community

Agent Series (14): Agent Observability — Tracing Every Decision, Making the Black Box Transparent

The Agent Black Box

Three Observability Levels

The Core: AgentTracer

Demo 1: Live Trace

Demo 2: Step Timeline

Demo 3: Structured Audit Log

Pattern Comparison

Design Checklist

Summary

References

Top comments (0)