DEV Community

Cover image for I Added a 71-Line Black Box to My Python Agent, Then Queried the $200 Crash With DuckDB
S M Tahosin
S M Tahosin

Posted on

I Added a 71-Line Black Box to My Python Agent, Then Queried the $200 Crash With DuckDB

The incident started with a boring support automation task.

Take a user request, search a private document index, summarize the answer, and hand the result to a reviewer. Nothing heroic. The kind of Python agent you build when the demo is over and the real workflow begins.

Then one run got stuck in a retry loop.

It did not burn $200 before I caught it. The actual test run was cheaper. The problem was the projection: same bad loop, same document search, same model calls, left inside the overnight batch. The estimate landed close to $200 for one avoidable failure.

The answer it produced looked polished enough to pass a sleepy review. The trace behind it was not polished at all. The agent had called the right tool with the wrong input, retried against stale context, summarized old results, and kept paying for each turn.

That is when I stopped treating the agent like a chat feature.

I started treating it like a system that needs a black box.

Not a dashboard. Not a full observability stack. Not another hosted service.

Just one local file that can answer:

  • What did the agent try?
  • Which tool did it call?
  • What input did the tool receive?
  • Did the tool fail?
  • How long did it take?
  • Did the run cross a cost or turn limit?
  • Can I query the run after everything is over?

We will build that black box in plain Python, then use DuckDB to inspect it like a tiny crash database.

Before And After

Before the fix, debugging looked like this:

The final answer is wrong.
The model probably hallucinated.
Maybe the search tool returned bad data.
Maybe the retry loop reused an old message.
Maybe the cost spike came from the model call.
Enter fullscreen mode Exit fullscreen mode

That is not debugging. That is guessing with syntax highlighting.

After the fix, debugging looked like this:

Turn 1 called search_docs with the wrong query.
The tool timed out after 147.82 ms.
The retry used stale context.
The guard stopped the run at $0.0124.
DuckDB shows one tool_error and one guard_stop.
Enter fullscreen mode Exit fullscreen mode

Same bug. Very different day.

The Shape Of The Problem

A normal Python script usually fails in one place.

An agent fails across a chain.

User Request -> Model Decision -> Tool Call -> Tool Result -> Next Turn -> Final Answer
Enter fullscreen mode Exit fullscreen mode

Agent run flow diagram

If you only log the final answer, you have a diary entry.

If you record the chain, you have evidence.

The simplest useful format is JSONL. One event per line.

{"type":"tool_start","tool":"search_docs","input":{"query":"rate limits"}}
{"type":"tool_end","tool":"search_docs","duration_ms":83.4,"ok":true}
{"type":"turn_end","turn":2,"total_cost_usd":0.0041}
Enter fullscreen mode Exit fullscreen mode

JSONL is boring in exactly the right way. It appends cleanly, survives crashes better than one large JSON document, and can be searched with normal tools.

JSONL trace from a failed run

A Small Recorder That Does Real Work

Here is the recorder.

It does four things:

  • gives every run a unique id
  • writes append-only JSONL events
  • measures tool duration
  • sanitizes obvious secrets before writing anything to disk
from __future__ import annotations

import json
import re
import time
import traceback
from contextlib import contextmanager
from dataclasses import asdict, dataclass, field
from pathlib import Path
from typing import Any, Iterator
from uuid import uuid4


SECRET_KEYS = re.compile(
    r"(api[_-]?key|token|password|secret|authorization|cookie)",
    re.IGNORECASE,
)


@dataclass
class Event:
    run_id: str
    event_id: str
    type: str
    timestamp: float
    data: dict[str, Any] = field(default_factory=dict)


def sanitize(value: Any) -> Any:
    if isinstance(value, dict):
        cleaned = {}
        for key, item in value.items():
            if SECRET_KEYS.search(str(key)):
                cleaned[key] = "[redacted]"
            else:
                cleaned[key] = sanitize(item)
        return cleaned

    if isinstance(value, list):
        return [sanitize(item) for item in value]

    return value


class AgentBlackBox:
    def __init__(self, path: str | Path, run_id: str | None = None) -> None:
        self.path = Path(path)
        self.run_id = run_id or uuid4().hex
        self.path.parent.mkdir(parents=True, exist_ok=True)

    def record(self, event_type: str, **data: Any) -> None:
        event = Event(
            run_id=self.run_id,
            event_id=uuid4().hex,
            type=event_type,
            timestamp=time.time(),
            data=sanitize(data),
        )

        with self.path.open("a", encoding="utf-8") as file:
            file.write(json.dumps(asdict(event), default=str) + "\n")

    @contextmanager
    def tool(self, name: str, **tool_input: Any) -> Iterator[None]:
        started = time.perf_counter()
        self.record("tool_start", tool=name, input=tool_input)

        try:
            yield
        except Exception as exc:
            self.record(
                "tool_error",
                tool=name,
                error_type=type(exc).__name__,
                error=str(exc),
                traceback=traceback.format_exc(limit=6),
                duration_ms=round((time.perf_counter() - started) * 1000, 2),
            )
            raise
        else:
            self.record(
                "tool_end",
                tool=name,
                ok=True,
                duration_ms=round((time.perf_counter() - started) * 1000, 2),
            )
Enter fullscreen mode Exit fullscreen mode

The sanitize() function is not perfect. It is a seatbelt, not a vault.

Still, it prevents the most embarrassing version of this pattern: building a helpful debug trace that quietly stores API keys.

Wrap One Tool First

Start with one tool. Do not instrument everything on day one.

import random
import time


def search_docs(query: str, api_key: str) -> list[str]:
    time.sleep(random.uniform(0.05, 0.2))

    if "timeout" in query:
        raise TimeoutError("Document search timed out")

    return [
        "JSONL works well for append-only traces.",
        "Context managers are useful around tool calls.",
        "DuckDB can query JSON files without a server.",
    ]
Enter fullscreen mode Exit fullscreen mode

Now record the call:

box = AgentBlackBox("traces/run.jsonl")

query = "python agent trace format"

with box.tool("search_docs", query=query, api_key="sk-not-a-real-key"):
    docs = search_docs(query=query, api_key="sk-not-a-real-key")

box.record("tool_result", tool="search_docs", result_count=len(docs))
Enter fullscreen mode Exit fullscreen mode

Open traces/run.jsonl and the key is redacted.

{"tool":"search_docs","input":{"query":"python agent trace format","api_key":"[redacted]"}}
Enter fullscreen mode Exit fullscreen mode

That tiny detail matters. Debugging should not create a second incident.

Add A Cheap Run Guard

Most runaway agent stories start with a loop that looked harmless.

So the black box should not only record what happened. It should record when it refused to continue.

class RunStopped(RuntimeError):
    pass


def stop_if_needed(
    box: AgentBlackBox,
    *,
    turn: int,
    max_turns: int,
    spent_usd: float,
    max_usd: float,
) -> None:
    box.record(
        "guard_check",
        turn=turn,
        max_turns=max_turns,
        spent_usd=round(spent_usd, 6),
        max_usd=round(max_usd, 6),
    )

    if turn > max_turns:
        box.record("guard_stop", reason="max_turns", turn=turn)
        raise RunStopped(f"Stopped at turn {turn}. Max turns is {max_turns}.")

    if spent_usd > max_usd:
        box.record("guard_stop", reason="budget", spent_usd=spent_usd)
        raise RunStopped(f"Stopped at ${spent_usd:.4f}. Budget is ${max_usd:.4f}.")
Enter fullscreen mode Exit fullscreen mode

This is not exact billing. Use your provider response for real token counts when you have them.

The goal here is a local tripwire. You want the run to leave a clear reason when it stops.

A Tiny Agent Loop

This fake loop keeps the moving parts small.

Replace the pretend model section with your real model call.

def estimate_cost(input_tokens: int, output_tokens: int) -> float:
    return input_tokens * 0.0000005 + output_tokens * 0.0000015


def run_agent(question: str) -> str:
    box = AgentBlackBox("traces/run.jsonl")
    messages = [{"role": "user", "content": question}]
    spent_usd = 0.0
    max_turns = 3
    max_usd = 0.01

    box.record("run_start", question=question, max_turns=max_turns, max_usd=max_usd)

    for turn in range(1, max_turns + 1):
        stop_if_needed(
            box,
            turn=turn,
            max_turns=max_turns,
            spent_usd=spent_usd,
            max_usd=max_usd,
        )

        box.record("turn_start", turn=turn, message_count=len(messages))

        # Pretend the model picked this tool input.
        query = question if turn == 1 else "python jsonl duckdb traces"

        with box.tool("search_docs", query=query, api_key="sk-not-a-real-key"):
            docs = search_docs(query=query, api_key="sk-not-a-real-key")

        messages.append({"role": "tool", "content": "\n".join(docs)})

        turn_cost = estimate_cost(
            input_tokens=sum(len(message["content"].split()) for message in messages),
            output_tokens=120,
        )
        spent_usd += turn_cost

        box.record(
            "turn_end",
            turn=turn,
            message_count=len(messages),
            turn_cost_usd=round(turn_cost, 6),
            total_cost_usd=round(spent_usd, 6),
        )

    answer = "Record every tool call as JSONL, then query failures after the run."
    box.record("run_end", answer=answer, total_cost_usd=round(spent_usd, 6))
    return answer
Enter fullscreen mode Exit fullscreen mode

Run it once with a normal question.

print(run_agent("How should I debug Python agent tools?"))
Enter fullscreen mode Exit fullscreen mode

Then run it with a bad one.

print(run_agent("timeout during document search"))
Enter fullscreen mode Exit fullscreen mode

The second run should fail, but now it fails with a trail.

To force a budget stop for testing, temporarily set max_usd = 0.0001. The next guard check will write a guard_stop event instead of letting the loop continue quietly.

Query The Crash With DuckDB

This is the part that makes JSONL feel less like logging and more like a debugging tool.

Install DuckDB:

pip install duckdb
Enter fullscreen mode Exit fullscreen mode

Then query the trace:

import duckdb


def query_trace(path: str = "traces/run.jsonl") -> None:
    con = duckdb.connect()

    con.sql(
        f"""
        create or replace view events as
        select *
        from read_json_auto('{path}');
        """
    )

    print("Event counts")
    con.sql(
        """
        select type, count(*) as events
        from events
        group by type
        order by events desc;
        """
    ).show()

    print("Tool errors")
    con.sql(
        """
        select
            data.tool as tool,
            data.error_type as error_type,
            data.error as error,
            data.duration_ms as duration_ms
        from events
        where type = 'tool_error';
        """
    ).show()

    print("Slow tools")
    con.sql(
        """
        select
            data.tool as tool,
            data.duration_ms as duration_ms
        from events
        where type = 'tool_end'
        order by data.duration_ms desc
        limit 5;
        """
    ).show()
Enter fullscreen mode Exit fullscreen mode

Now run:

query_trace()
Enter fullscreen mode Exit fullscreen mode

The payoff should look something like this:

DuckDB query output for an agent crash

Event counts
+-------------+--------+
| type        | events |
+-------------+--------+
| guard_check |      4 |
| turn_start  |      3 |
| tool_start  |      3 |
| tool_end    |      2 |
| tool_error  |      1 |
| guard_stop  |      1 |
+-------------+--------+
Enter fullscreen mode Exit fullscreen mode

And the crash row is now a query result, not a mystery:

Tool errors
+-------------+--------------+---------------------------+-------------+
| tool        | error_type   | error                     | duration_ms |
+-------------+--------------+---------------------------+-------------+
| search_docs | TimeoutError | Document search timed out |      147.82 |
+-------------+--------------+---------------------------+-------------+
Enter fullscreen mode Exit fullscreen mode

You can answer questions that normal print logs make annoying:

  • Which tools failed most often?
  • Which tool was slowest?
  • Which turn crossed the budget?
  • Did the same input fail repeatedly?
  • Did the guard stop the run, or did the tool crash first?

That is the upgrade.

Not "I have logs."

"I can interrogate the run."

What I Would Record In A Real Project

For a demo, the trace above is enough.

For a real project, I would add these fields:

  • model
  • provider
  • prompt_hash
  • tool_schema_version
  • input_tokens
  • output_tokens
  • finish_reason
  • retry_count
  • user_id_hash
  • environment

I would not record these by default:

  • raw access tokens
  • private documents
  • full customer prompts
  • full tool responses with sensitive data
  • cookies or request headers

The boring security rule is simple:

Record enough to debug behavior. Do not record enough to harm someone.

The Pattern In One Sentence

Every agent run should produce a local, append-only event stream that is safe to keep, easy to query, and useful after the process crashes.

That sentence is less exciting than a new prompt trick.

It is also more likely to save your weekend.

Full File

Here is the complete example in one place.

from __future__ import annotations

import json
import random
import re
import time
import traceback
from contextlib import contextmanager
from dataclasses import asdict, dataclass, field
from pathlib import Path
from typing import Any, Iterator
from uuid import uuid4


SECRET_KEYS = re.compile(
    r"(api[_-]?key|token|password|secret|authorization|cookie)",
    re.IGNORECASE,
)


@dataclass
class Event:
    run_id: str
    event_id: str
    type: str
    timestamp: float
    data: dict[str, Any] = field(default_factory=dict)


def sanitize(value: Any) -> Any:
    if isinstance(value, dict):
        return {
            key: "[redacted]" if SECRET_KEYS.search(str(key)) else sanitize(item)
            for key, item in value.items()
        }

    if isinstance(value, list):
        return [sanitize(item) for item in value]

    return value


class AgentBlackBox:
    def __init__(self, path: str | Path, run_id: str | None = None) -> None:
        self.path = Path(path)
        self.run_id = run_id or uuid4().hex
        self.path.parent.mkdir(parents=True, exist_ok=True)

    def record(self, event_type: str, **data: Any) -> None:
        event = Event(
            run_id=self.run_id,
            event_id=uuid4().hex,
            type=event_type,
            timestamp=time.time(),
            data=sanitize(data),
        )

        with self.path.open("a", encoding="utf-8") as file:
            file.write(json.dumps(asdict(event), default=str) + "\n")

    @contextmanager
    def tool(self, name: str, **tool_input: Any) -> Iterator[None]:
        started = time.perf_counter()
        self.record("tool_start", tool=name, input=tool_input)

        try:
            yield
        except Exception as exc:
            self.record(
                "tool_error",
                tool=name,
                error_type=type(exc).__name__,
                error=str(exc),
                traceback=traceback.format_exc(limit=6),
                duration_ms=round((time.perf_counter() - started) * 1000, 2),
            )
            raise
        else:
            self.record(
                "tool_end",
                tool=name,
                ok=True,
                duration_ms=round((time.perf_counter() - started) * 1000, 2),
            )


class RunStopped(RuntimeError):
    pass


def stop_if_needed(
    box: AgentBlackBox,
    *,
    turn: int,
    max_turns: int,
    spent_usd: float,
    max_usd: float,
) -> None:
    box.record(
        "guard_check",
        turn=turn,
        max_turns=max_turns,
        spent_usd=round(spent_usd, 6),
        max_usd=round(max_usd, 6),
    )

    if turn > max_turns:
        box.record("guard_stop", reason="max_turns", turn=turn)
        raise RunStopped(f"Stopped at turn {turn}. Max turns is {max_turns}.")

    if spent_usd > max_usd:
        box.record("guard_stop", reason="budget", spent_usd=spent_usd)
        raise RunStopped(f"Stopped at ${spent_usd:.4f}. Budget is ${max_usd:.4f}.")


def search_docs(query: str, api_key: str) -> list[str]:
    time.sleep(random.uniform(0.05, 0.2))

    if "timeout" in query:
        raise TimeoutError("Document search timed out")

    return [
        "JSONL works well for append-only traces.",
        "Context managers are useful around tool calls.",
        "DuckDB can query JSON files without a server.",
    ]


def estimate_cost(input_tokens: int, output_tokens: int) -> float:
    return input_tokens * 0.0000005 + output_tokens * 0.0000015


def run_agent(question: str) -> str:
    box = AgentBlackBox("traces/run.jsonl")
    messages = [{"role": "user", "content": question}]
    spent_usd = 0.0
    max_turns = 3
    max_usd = 0.01

    box.record("run_start", question=question, max_turns=max_turns, max_usd=max_usd)

    for turn in range(1, max_turns + 1):
        stop_if_needed(
            box,
            turn=turn,
            max_turns=max_turns,
            spent_usd=spent_usd,
            max_usd=max_usd,
        )

        box.record("turn_start", turn=turn, message_count=len(messages))

        query = question if turn == 1 else "python jsonl duckdb traces"

        with box.tool("search_docs", query=query, api_key="sk-not-a-real-key"):
            docs = search_docs(query=query, api_key="sk-not-a-real-key")

        messages.append({"role": "tool", "content": "\n".join(docs)})

        turn_cost = estimate_cost(
            input_tokens=sum(len(message["content"].split()) for message in messages),
            output_tokens=120,
        )
        spent_usd += turn_cost

        box.record(
            "turn_end",
            turn=turn,
            message_count=len(messages),
            turn_cost_usd=round(turn_cost, 6),
            total_cost_usd=round(spent_usd, 6),
        )

    answer = "Record every tool call as JSONL, then query failures after the run."
    box.record("run_end", answer=answer, total_cost_usd=round(spent_usd, 6))
    return answer


if __name__ == "__main__":
    print(run_agent("How should I debug Python agent tools?"))
Enter fullscreen mode Exit fullscreen mode

There is one line in that full file worth staring at:

box.record("run_start", question=question, max_turns=max_turns, max_usd=max_usd)
Enter fullscreen mode Exit fullscreen mode

That line changes the posture of the program.

The run is no longer a private conversation with a model. It is a recorded execution with a trace you can inspect, query, and improve.

That is the difference between a demo and something you can trust.

What would you add next: prompt hashes, token counts, screenshots, checkpoints, or replayable tool fixtures?

Top comments (0)