The Agent Black Box
You send a request to your Agent. Six seconds later, you get an answer.
What happened during those six seconds?
- How many times did the LLM think?
- Which tools were called, with what arguments, returning what?
- Where did the latency actually live?
- If the answer was wrong, which step caused it?
A plain agent.invoke() tells you nothing. Agent observability solves this: it makes every decision and its cost visible.
Three Observability Levels
Development → Live Trace (see what's happening)
Analysis → Latency Timeline (see where time goes)
Production → Audit Log (record each request, replayable)
All three patterns share one foundation: LangChain's BaseCallbackHandler.
The Core: AgentTracer
BaseCallbackHandler provides hooks for the full lifecycle of LLM calls and tool invocations:
from langchain_core.callbacks import BaseCallbackHandler
from langchain_core.outputs import LLMResult
class AgentTracer(BaseCallbackHandler):
def __init__(self, verbose: bool = True, trace_id: str = "") -> None:
super().__init__()
self.verbose = verbose
self.trace_id = trace_id or str(uuid.uuid4())[:8]
self.steps: list[StepRecord] = []
self._llm_t0: float = 0.0
self._tool_t0: float = 0.0
self._tool_name: str = ""
# Chat models fire on_chat_model_start, not on_llm_start
def on_chat_model_start(self, serialized, messages, **kwargs) -> None:
self._llm_t0 = time.time()
if self.verbose:
print(f" [LLM →] reasoning...")
def on_llm_end(self, response: LLMResult, **kwargs) -> None:
t1 = time.time()
# Tool-calling turns have empty content; only the final turn has text
output = _extract_text(response)
self.steps.append(StepRecord(
step_type="llm", name="LLM",
output_preview=output,
start_time=self._llm_t0, end_time=t1,
))
if self.verbose:
print(f" [LLM ←] {t1 - self._llm_t0:.0f}ms | {output[:70]}")
def on_tool_start(self, serialized, input_str: str, **kwargs) -> None:
self._tool_t0 = time.time()
self._tool_name = serialized.get("name", "tool")
if self.verbose:
print(f" [TOOL→] {self._tool_name}({str(input_str)[:60]})")
def on_tool_end(self, output: str, **kwargs) -> None:
t1 = time.time()
self.steps.append(StepRecord(
step_type="tool", name=self._tool_name,
output_preview=str(output)[:80],
start_time=self._tool_t0, end_time=t1,
))
if self.verbose:
print(f" [TOOL←] {str(output)[:80]} [{(t1-self._tool_t0)*1000:.0f}ms]")
One important detail: chat models fire on_chat_model_start, not on_llm_start. Implement both for full model-type coverage.
Pass the tracer via config when invoking:
tracer = AgentTracer(verbose=True)
result = agent.invoke(
{"messages": [HumanMessage(query)]},
config={"callbacks": [tracer]},
)
Demo 1: Live Trace
Query: "What is the weather in Beijing and Shanghai? Calculate the temperature difference."
Real output:
[LLM →] reasoning...
[LLM ←] 1544ms | ← tool-calling turn, content empty
[TOOL→] get_weather({'city': 'Beijing'})
[TOOL←] {"city": "Beijing", "temp": 25, "condition": "sunny"} [1ms]
[LLM →] reasoning...
[LLM ←] 1236ms | ← tool-calling turn, content empty
[TOOL→] get_weather({'city': 'Shanghai'})
[TOOL←] {"city": "Shanghai", "temp": 22, "condition": "cloudy"} [1ms]
[LLM →] reasoning...
[LLM ←] 2895ms | ← tool-calling turn, content empty
[TOOL→] calculator({'expression': '25.0-22.0'})
[TOOL←] 25.0-22.0 = 3.0 [1ms]
[LLM →] reasoning...
[LLM ←] 2465ms | The temperature in Beijing is 25°C and Shanghai is 22°C...
Trace summary: 7 steps (4 LLM calls, 3 tool calls)
Why is output_preview empty for the middle LLM steps?
In a ReAct Agent, every time the LLM decides to call a tool, it outputs tool_calls JSON — not text. That JSON is consumed by the LangGraph framework and never appears in the content text field. Only the final LLM turn (when the model decides to stop calling tools and answer directly) produces text content.
This is exactly the value of a trace: without it you only know "it took 8 seconds." With it you know: 4 LLM calls and 3 tool calls, and which reasoning step took the longest.
Demo 2: Step Timeline
Query: "Tell me the WonderBot Pro price and calculate 299 * 12 for annual cost."
Run silently, then analyze tracer.steps to build an ASCII timeline:
total_ms = sum(s.duration_ms for s in tracer.steps)
bar_scale = 40 / total_ms
for i, step in enumerate(tracer.steps, 1):
bar = "█" * max(int(step.duration_ms * bar_scale), 1)
label = "LLM reasoning" if step.step_type == "llm" else f"tool: {step.name}"
print(f" Step {i} {label:<25} [{step.duration_ms:>6.0f}ms] {bar}")
Real output:
Step-by-step breakdown:
Step 1 LLM reasoning [ 2409ms] ██████████████
Step 2 tool: get_product_info [ 1ms] █
Step 3 LLM reasoning [ 2069ms] ████████████
Step 4 tool: calculator [ 1ms] █
Step 5 LLM reasoning [ 1977ms] ████████████
────────────────────────────────────────────────────────────
Total : 6457ms
LLM : 6455ms (100.0% of wall time)
Tools : 2ms (0.0% of wall time)
LLM = 100%. Tools = 0%.
This is the most important insight for Agent performance optimization: tool calls are essentially free (2ms). All latency comes from the LLM. When you want to speed up an Agent, reduce the number of LLM calls — fewer reasoning steps, merged requests, cached responses. Optimizing tool implementations accomplishes almost nothing.
Demo 3: Structured Audit Log
Query: "What's the weather in Shenzhen and how much does WonderBot Basic cost?"
After the run, serialize tracer.steps to JSON:
def build_audit_log(tracer: AgentTracer, query: str, answer: str) -> dict:
steps_log = []
for s in tracer.steps:
entry = {"type": s.step_type, "duration_ms": round(s.duration_ms, 1)}
if s.step_type == "tool":
entry["tool"] = s.name
entry["input"] = s.input_preview
entry["output"] = s.output_preview
else:
entry["output_preview"] = s.output_preview
steps_log.append(entry)
...
Real output:
{
"trace_id": "6c8e4b64",
"query": "What's the weather in Shenzhen and how much does WonderBot Basic cost?",
"answer": "...The cost of WonderBot Basic is not available.",
"steps": [
{"type": "llm", "duration_ms": 593.0, "output_preview": ""},
{"type": "tool", "duration_ms": 1.1, "tool": "get_weather",
"input": "{'city': 'Shenzhen'}", "output": "..."},
{"type": "llm", "duration_ms": 1164.9, "output_preview": ""},
{"type": "tool", "duration_ms": 1.0, "tool": "get_product_info",
"input": "{'product_name': 'Basic'}", "output": "Product 'Basic' not found..."},
{"type": "llm", "duration_ms": 1190.6, "output_preview": "The weather in Shenzhen..."}
],
"summary": {
"step_count": 5, "tool_call_count": 2,
"total_ms": 2950.6, "llm_ms": 2948.5, "tool_ms": 2.0
}
}
Look at step 4: get_product_info received {'product_name': 'Basic'} — the LLM passed "Basic" without the "wonderbot" prefix, so the lookup failed. The audit log captured the raw tool arguments. Without it, all you'd see in the final answer is "not available," with no idea why. With the log, you can see exactly which parameter the LLM chose to pass.
This is why audit logs record tool inputs, not just tool names.
Pattern Comparison
Pattern Use case Perf impact Output
──────────────────────────────────────────────────────────────────────
Live trace Development, debugging Yes (I/O) Console event stream
Timeline Performance analysis No (post) ASCII chart
Audit log Production, compliance No (post) JSON per request
All three share one AgentTracer. Set verbose=True for live output during development, verbose=False for silent recording in production.
Design Checklist
Tracer Implementation
- [ ] Implement
on_chat_model_start, not juston_llm_start(safer: implement both) - [ ] In
on_tool_start, save the tool name and input; use those saved values inon_tool_end - [ ] Use
start_time/end_timeinstead of computing delta inon_end— supports nested calls - [ ] Make
StepRecorda dataclass for easy serialization and downstream analysis
Audit Log
- [ ] Generate a unique
trace_idper request (first 8 chars of a UUID is enough) - [ ] Record raw tool input arguments, not just tool name — that's where bugs hide
- [ ] Expect empty
output_previewfor intermediate LLM steps: tool-calling turns have no text content - [ ] In production, write audit logs to a database or logging system, not just stdout
Performance Analysis
- [ ] Always separate
llm_msfromtool_msin your metrics - [ ] Optimization priority: fewer LLM calls >> faster tool execution
- [ ] Establish p50/p95 latency baselines to detect regressions after Agent changes
Summary
Five core takeaways:
-
LangChain callbacks are the foundation of Agent observability: four methods on
BaseCallbackHandlercover the full LLM and tool lifecycle -
Chat models fire
on_chat_model_start: easy-to-miss gotcha —on_llm_startnever fires forChatOpenAI -
Intermediate LLM steps have empty content: tool-calling turns produce
tool_callsJSON, not text — this is normal behavior - LLM = 99-100% of Agent latency: tools are in the 1-2ms range; optimize by reducing LLM calls, not tool implementations
- Audit log value = tool input parameters: the final answer alone isn't enough — the raw arguments the LLM chose to pass are what you need for debugging
Up next: Advanced Agent Memory — short-term memory, long-term memory, summary compression, and how to give an Agent context that persists across sessions.
References
- LangChain Callbacks documentation
- LangGraph documentation
- Full demo code for this series: agent-13-observability
Check out PrimeSkills — a curated marketplace of AI agents and skills that have been validated in real-world, enterprise-grade workflows. No fluff, just what actually works.
Find more useful knowledge and interesting products on my Homepage
Top comments (0)