The incident started with a boring support automation task.
Take a user request, search a private document index, summarize the answer, and hand the result to a reviewer. Nothing heroic. The kind of Python agent you build when the demo is over and the real workflow begins.
Then one run got stuck in a retry loop.
It did not burn $200 before I caught it. The actual test run was cheaper. The problem was the projection: same bad loop, same document search, same model calls, left inside the overnight batch. The estimate landed close to $200 for one avoidable failure.
The answer it produced looked polished enough to pass a sleepy review. The trace behind it was not polished at all. The agent had called the right tool with the wrong input, retried against stale context, summarized old results, and kept paying for each turn.
That is when I stopped treating the agent like a chat feature.
I started treating it like a system that needs a black box.
Not a dashboard. Not a full observability stack. Not another hosted service.
Just one local file that can answer:
- What did the agent try?
- Which tool did it call?
- What input did the tool receive?
- Did the tool fail?
- How long did it take?
- Did the run cross a cost or turn limit?
- Can I query the run after everything is over?
We will build that black box in plain Python, then use DuckDB to inspect it like a tiny crash database.
Before And After
Before the fix, debugging looked like this:
The final answer is wrong.
The model probably hallucinated.
Maybe the search tool returned bad data.
Maybe the retry loop reused an old message.
Maybe the cost spike came from the model call.
That is not debugging. That is guessing with syntax highlighting.
After the fix, debugging looked like this:
Turn 1 called search_docs with the wrong query.
The tool timed out after 147.82 ms.
The retry used stale context.
The guard stopped the run at $0.0124.
DuckDB shows one tool_error and one guard_stop.
Same bug. Very different day.
The Shape Of The Problem
A normal Python script usually fails in one place.
An agent fails across a chain.
User Request -> Model Decision -> Tool Call -> Tool Result -> Next Turn -> Final Answer
If you only log the final answer, you have a diary entry.
If you record the chain, you have evidence.
The simplest useful format is JSONL. One event per line.
{"type":"tool_start","tool":"search_docs","input":{"query":"rate limits"}}
{"type":"tool_end","tool":"search_docs","duration_ms":83.4,"ok":true}
{"type":"turn_end","turn":2,"total_cost_usd":0.0041}
JSONL is boring in exactly the right way. It appends cleanly, survives crashes better than one large JSON document, and can be searched with normal tools.
A Small Recorder That Does Real Work
Here is the recorder.
It does four things:
- gives every run a unique id
- writes append-only JSONL events
- measures tool duration
- sanitizes obvious secrets before writing anything to disk
from __future__ import annotations
import json
import re
import time
import traceback
from contextlib import contextmanager
from dataclasses import asdict, dataclass, field
from pathlib import Path
from typing import Any, Iterator
from uuid import uuid4
SECRET_KEYS = re.compile(
r"(api[_-]?key|token|password|secret|authorization|cookie)",
re.IGNORECASE,
)
@dataclass
class Event:
run_id: str
event_id: str
type: str
timestamp: float
data: dict[str, Any] = field(default_factory=dict)
def sanitize(value: Any) -> Any:
if isinstance(value, dict):
cleaned = {}
for key, item in value.items():
if SECRET_KEYS.search(str(key)):
cleaned[key] = "[redacted]"
else:
cleaned[key] = sanitize(item)
return cleaned
if isinstance(value, list):
return [sanitize(item) for item in value]
return value
class AgentBlackBox:
def __init__(self, path: str | Path, run_id: str | None = None) -> None:
self.path = Path(path)
self.run_id = run_id or uuid4().hex
self.path.parent.mkdir(parents=True, exist_ok=True)
def record(self, event_type: str, **data: Any) -> None:
event = Event(
run_id=self.run_id,
event_id=uuid4().hex,
type=event_type,
timestamp=time.time(),
data=sanitize(data),
)
with self.path.open("a", encoding="utf-8") as file:
file.write(json.dumps(asdict(event), default=str) + "\n")
@contextmanager
def tool(self, name: str, **tool_input: Any) -> Iterator[None]:
started = time.perf_counter()
self.record("tool_start", tool=name, input=tool_input)
try:
yield
except Exception as exc:
self.record(
"tool_error",
tool=name,
error_type=type(exc).__name__,
error=str(exc),
traceback=traceback.format_exc(limit=6),
duration_ms=round((time.perf_counter() - started) * 1000, 2),
)
raise
else:
self.record(
"tool_end",
tool=name,
ok=True,
duration_ms=round((time.perf_counter() - started) * 1000, 2),
)
The sanitize() function is not perfect. It is a seatbelt, not a vault.
Still, it prevents the most embarrassing version of this pattern: building a helpful debug trace that quietly stores API keys.
Wrap One Tool First
Start with one tool. Do not instrument everything on day one.
import random
import time
def search_docs(query: str, api_key: str) -> list[str]:
time.sleep(random.uniform(0.05, 0.2))
if "timeout" in query:
raise TimeoutError("Document search timed out")
return [
"JSONL works well for append-only traces.",
"Context managers are useful around tool calls.",
"DuckDB can query JSON files without a server.",
]
Now record the call:
box = AgentBlackBox("traces/run.jsonl")
query = "python agent trace format"
with box.tool("search_docs", query=query, api_key="sk-not-a-real-key"):
docs = search_docs(query=query, api_key="sk-not-a-real-key")
box.record("tool_result", tool="search_docs", result_count=len(docs))
Open traces/run.jsonl and the key is redacted.
{"tool":"search_docs","input":{"query":"python agent trace format","api_key":"[redacted]"}}
That tiny detail matters. Debugging should not create a second incident.
Add A Cheap Run Guard
Most runaway agent stories start with a loop that looked harmless.
So the black box should not only record what happened. It should record when it refused to continue.
class RunStopped(RuntimeError):
pass
def stop_if_needed(
box: AgentBlackBox,
*,
turn: int,
max_turns: int,
spent_usd: float,
max_usd: float,
) -> None:
box.record(
"guard_check",
turn=turn,
max_turns=max_turns,
spent_usd=round(spent_usd, 6),
max_usd=round(max_usd, 6),
)
if turn > max_turns:
box.record("guard_stop", reason="max_turns", turn=turn)
raise RunStopped(f"Stopped at turn {turn}. Max turns is {max_turns}.")
if spent_usd > max_usd:
box.record("guard_stop", reason="budget", spent_usd=spent_usd)
raise RunStopped(f"Stopped at ${spent_usd:.4f}. Budget is ${max_usd:.4f}.")
This is not exact billing. Use your provider response for real token counts when you have them.
The goal here is a local tripwire. You want the run to leave a clear reason when it stops.
A Tiny Agent Loop
This fake loop keeps the moving parts small.
Replace the pretend model section with your real model call.
def estimate_cost(input_tokens: int, output_tokens: int) -> float:
return input_tokens * 0.0000005 + output_tokens * 0.0000015
def run_agent(question: str) -> str:
box = AgentBlackBox("traces/run.jsonl")
messages = [{"role": "user", "content": question}]
spent_usd = 0.0
max_turns = 3
max_usd = 0.01
box.record("run_start", question=question, max_turns=max_turns, max_usd=max_usd)
for turn in range(1, max_turns + 1):
stop_if_needed(
box,
turn=turn,
max_turns=max_turns,
spent_usd=spent_usd,
max_usd=max_usd,
)
box.record("turn_start", turn=turn, message_count=len(messages))
# Pretend the model picked this tool input.
query = question if turn == 1 else "python jsonl duckdb traces"
with box.tool("search_docs", query=query, api_key="sk-not-a-real-key"):
docs = search_docs(query=query, api_key="sk-not-a-real-key")
messages.append({"role": "tool", "content": "\n".join(docs)})
turn_cost = estimate_cost(
input_tokens=sum(len(message["content"].split()) for message in messages),
output_tokens=120,
)
spent_usd += turn_cost
box.record(
"turn_end",
turn=turn,
message_count=len(messages),
turn_cost_usd=round(turn_cost, 6),
total_cost_usd=round(spent_usd, 6),
)
answer = "Record every tool call as JSONL, then query failures after the run."
box.record("run_end", answer=answer, total_cost_usd=round(spent_usd, 6))
return answer
Run it once with a normal question.
print(run_agent("How should I debug Python agent tools?"))
Then run it with a bad one.
print(run_agent("timeout during document search"))
The second run should fail, but now it fails with a trail.
To force a budget stop for testing, temporarily set max_usd = 0.0001. The next guard check will write a guard_stop event instead of letting the loop continue quietly.
Query The Crash With DuckDB
This is the part that makes JSONL feel less like logging and more like a debugging tool.
Install DuckDB:
pip install duckdb
Then query the trace:
import duckdb
def query_trace(path: str = "traces/run.jsonl") -> None:
con = duckdb.connect()
con.sql(
f"""
create or replace view events as
select *
from read_json_auto('{path}');
"""
)
print("Event counts")
con.sql(
"""
select type, count(*) as events
from events
group by type
order by events desc;
"""
).show()
print("Tool errors")
con.sql(
"""
select
data.tool as tool,
data.error_type as error_type,
data.error as error,
data.duration_ms as duration_ms
from events
where type = 'tool_error';
"""
).show()
print("Slow tools")
con.sql(
"""
select
data.tool as tool,
data.duration_ms as duration_ms
from events
where type = 'tool_end'
order by data.duration_ms desc
limit 5;
"""
).show()
Now run:
query_trace()
The payoff should look something like this:
Event counts
+-------------+--------+
| type | events |
+-------------+--------+
| guard_check | 4 |
| turn_start | 3 |
| tool_start | 3 |
| tool_end | 2 |
| tool_error | 1 |
| guard_stop | 1 |
+-------------+--------+
And the crash row is now a query result, not a mystery:
Tool errors
+-------------+--------------+---------------------------+-------------+
| tool | error_type | error | duration_ms |
+-------------+--------------+---------------------------+-------------+
| search_docs | TimeoutError | Document search timed out | 147.82 |
+-------------+--------------+---------------------------+-------------+
You can answer questions that normal print logs make annoying:
- Which tools failed most often?
- Which tool was slowest?
- Which turn crossed the budget?
- Did the same input fail repeatedly?
- Did the guard stop the run, or did the tool crash first?
That is the upgrade.
Not "I have logs."
"I can interrogate the run."
What I Would Record In A Real Project
For a demo, the trace above is enough.
For a real project, I would add these fields:
modelproviderprompt_hashtool_schema_versioninput_tokensoutput_tokensfinish_reasonretry_countuser_id_hashenvironment
I would not record these by default:
- raw access tokens
- private documents
- full customer prompts
- full tool responses with sensitive data
- cookies or request headers
The boring security rule is simple:
Record enough to debug behavior. Do not record enough to harm someone.
The Pattern In One Sentence
Every agent run should produce a local, append-only event stream that is safe to keep, easy to query, and useful after the process crashes.
That sentence is less exciting than a new prompt trick.
It is also more likely to save your weekend.
Full File
Here is the complete example in one place.
from __future__ import annotations
import json
import random
import re
import time
import traceback
from contextlib import contextmanager
from dataclasses import asdict, dataclass, field
from pathlib import Path
from typing import Any, Iterator
from uuid import uuid4
SECRET_KEYS = re.compile(
r"(api[_-]?key|token|password|secret|authorization|cookie)",
re.IGNORECASE,
)
@dataclass
class Event:
run_id: str
event_id: str
type: str
timestamp: float
data: dict[str, Any] = field(default_factory=dict)
def sanitize(value: Any) -> Any:
if isinstance(value, dict):
return {
key: "[redacted]" if SECRET_KEYS.search(str(key)) else sanitize(item)
for key, item in value.items()
}
if isinstance(value, list):
return [sanitize(item) for item in value]
return value
class AgentBlackBox:
def __init__(self, path: str | Path, run_id: str | None = None) -> None:
self.path = Path(path)
self.run_id = run_id or uuid4().hex
self.path.parent.mkdir(parents=True, exist_ok=True)
def record(self, event_type: str, **data: Any) -> None:
event = Event(
run_id=self.run_id,
event_id=uuid4().hex,
type=event_type,
timestamp=time.time(),
data=sanitize(data),
)
with self.path.open("a", encoding="utf-8") as file:
file.write(json.dumps(asdict(event), default=str) + "\n")
@contextmanager
def tool(self, name: str, **tool_input: Any) -> Iterator[None]:
started = time.perf_counter()
self.record("tool_start", tool=name, input=tool_input)
try:
yield
except Exception as exc:
self.record(
"tool_error",
tool=name,
error_type=type(exc).__name__,
error=str(exc),
traceback=traceback.format_exc(limit=6),
duration_ms=round((time.perf_counter() - started) * 1000, 2),
)
raise
else:
self.record(
"tool_end",
tool=name,
ok=True,
duration_ms=round((time.perf_counter() - started) * 1000, 2),
)
class RunStopped(RuntimeError):
pass
def stop_if_needed(
box: AgentBlackBox,
*,
turn: int,
max_turns: int,
spent_usd: float,
max_usd: float,
) -> None:
box.record(
"guard_check",
turn=turn,
max_turns=max_turns,
spent_usd=round(spent_usd, 6),
max_usd=round(max_usd, 6),
)
if turn > max_turns:
box.record("guard_stop", reason="max_turns", turn=turn)
raise RunStopped(f"Stopped at turn {turn}. Max turns is {max_turns}.")
if spent_usd > max_usd:
box.record("guard_stop", reason="budget", spent_usd=spent_usd)
raise RunStopped(f"Stopped at ${spent_usd:.4f}. Budget is ${max_usd:.4f}.")
def search_docs(query: str, api_key: str) -> list[str]:
time.sleep(random.uniform(0.05, 0.2))
if "timeout" in query:
raise TimeoutError("Document search timed out")
return [
"JSONL works well for append-only traces.",
"Context managers are useful around tool calls.",
"DuckDB can query JSON files without a server.",
]
def estimate_cost(input_tokens: int, output_tokens: int) -> float:
return input_tokens * 0.0000005 + output_tokens * 0.0000015
def run_agent(question: str) -> str:
box = AgentBlackBox("traces/run.jsonl")
messages = [{"role": "user", "content": question}]
spent_usd = 0.0
max_turns = 3
max_usd = 0.01
box.record("run_start", question=question, max_turns=max_turns, max_usd=max_usd)
for turn in range(1, max_turns + 1):
stop_if_needed(
box,
turn=turn,
max_turns=max_turns,
spent_usd=spent_usd,
max_usd=max_usd,
)
box.record("turn_start", turn=turn, message_count=len(messages))
query = question if turn == 1 else "python jsonl duckdb traces"
with box.tool("search_docs", query=query, api_key="sk-not-a-real-key"):
docs = search_docs(query=query, api_key="sk-not-a-real-key")
messages.append({"role": "tool", "content": "\n".join(docs)})
turn_cost = estimate_cost(
input_tokens=sum(len(message["content"].split()) for message in messages),
output_tokens=120,
)
spent_usd += turn_cost
box.record(
"turn_end",
turn=turn,
message_count=len(messages),
turn_cost_usd=round(turn_cost, 6),
total_cost_usd=round(spent_usd, 6),
)
answer = "Record every tool call as JSONL, then query failures after the run."
box.record("run_end", answer=answer, total_cost_usd=round(spent_usd, 6))
return answer
if __name__ == "__main__":
print(run_agent("How should I debug Python agent tools?"))
There is one line in that full file worth staring at:
box.record("run_start", question=question, max_turns=max_turns, max_usd=max_usd)
That line changes the posture of the program.
The run is no longer a private conversation with a model. It is a recorded execution with a trace you can inspect, query, and improve.
That is the difference between a demo and something you can trust.
What would you add next: prompt hashes, token counts, screenshots, checkpoints, or replayable tool fixtures?



Top comments (0)