DEV Community

Cover image for I Added a 71-Line Black Box to My Python Agent, Then Queried the $200 Crash With DuckDB

I Added a 71-Line Black Box to My Python Agent, Then Queried the $200 Crash With DuckDB

S M Tahosin on May 31, 2026

The incident started with a boring support automation task. Take a user request, search a private document index, summarize the answer, and hand t...
Collapse
 
mudassirworks profile image
Mudassir Khan

the "called the right tool with the wrong input, retried against stale context" description is the failure pattern hardest to catch. wrong input reuse is invisible in the final output.

worth adding: hash the tool input on each tool_start. if the same tool fires with an identical input hash on consecutive turns, that is a retry loop signal before the guard triggers. caught this in a document QA agent — same query string across 4 turns, model summarizing the same chunk each time with full confidence.

does the guard check input repetition, or is it purely turn count and spend?

Collapse
 
tahosin profile image
S M Tahosin

That's a great suggestion. I like the idea of tracking input hashes because, as you said, retry loops are often invisible if you're only looking at the final output.

In the version from the article, the guard is intentionally simple and only watches turn count and cost. It doesn't currently check for repeated inputs or repeated tool patterns. But I can definitely see value in adding that layer, especially for catching "same action, same context, same result" loops before they become expensive.

The document QA example is a perfect illustration of the kind of failure that a basic budget guard won't catch early enough.

Collapse
 
mudassirworks profile image
Mudassir Khan

yeah, simple and inspectable is the right call for v1. shipping the basic guard first makes sense.

the hash approach is cheap — short md5 of serialized args, stored in agent_state alongside the turn counter. two lines. the real win: makes DuckDB queries interesting. you can group by input_hash, count turns, and surface repeated failure patterns across sessions not just within one run.

have you queried across multiple agent runs yet, or just within a single session so far?

Collapse
 
zep1997 profile image
Self-Correcting Systems

This is a strong pattern.

The part I like most is that the trace is not just for debugging exceptions. It lets you
reconstruct the decision path after the fact: what the agent saw, what tool it called,
what input it used, what failed, and where the guard stopped it.

That matters because agent failures are rarely single-point failures. They are usually
chain failures: stale context → wrong tool input → retry → old result summarized
confidently → cost leak.

I’ve been testing a neighboring problem around agent memory: relevant context is not
always authoritative context. So one thing I would want in a black-box trace is not only
“what memory/context was used?” but also “what made that context allowed to govern the
next action?”

For example, I’d add fields like:

  • context_source
  • context_status (active, stale, superseded, provisional)
  • action_type (read, write, execute)
  • governing_rule
  • verification_required

Then after a crash, DuckDB could answer questions like:

  • Did the agent act from stale context?
  • Did a provisional memory govern an execute action?
  • Did a verify-first rule get skipped?
  • Which tool calls happened after the budget or confidence guard should have stopped the run?

That would connect observability with authority, not just observability with failure.

Really useful article. The “query the run after everything is over” framing is exactly
the right direction.

Collapse
 
tahosin profile image
S M Tahosin

I really like the distinction you're making between observability and authority.

One thing that became clear while building this was that many agent failures don't start where they become visible. By the time you see the bad tool call or the budget overrun, the actual mistake may have happened several steps earlier when the agent accepted a piece of context that it shouldn't have trusted.

Your proposed fields are interesting because they move the trace from "what happened?" toward "why was this allowed to happen?" That's a much harder question, and probably the one that matters most as agents start relying more heavily on memory and long-running state.

The idea of tracking context status and governing rules especially stands out to me. Being able to ask "which actions were influenced by stale or provisional context?" would expose an entire class of failures that basic logging completely misses.

I also like your example queries. They feel very similar to the transition from debugging software failures to auditing decision systems. At that point the trace becomes more than a reliability tool. It becomes a way to inspect authority flow through the run.

Definitely gave me a few ideas for a future version of the black box. Thanks for the thoughtful comment.

Collapse
 
zep1997 profile image
Self-Correcting Systems

Exactly, “where it becomes visible” and “where it became allowed” are two different
points in the run.

That distinction is the part I keep circling back to. A trace that only records the final
bad tool call can tell you what broke, but it may not tell you which memory, rule,
assumption, or stale context gave the agent permission to move in that direction.

That is where observability starts becoming authority inspection.

The useful trace fields are not only:

  • tool called
  • input used
  • duration
  • error
  • cost

but also:

  • which context influenced this action
  • what status that context had
  • what rule governed the tool call
  • whether a higher-authority rule was skipped
  • whether the action should have verified before executing

That would let you query failures backward from the action into the authority path.

Something like:

“Show me every write action influenced by provisional context.”

or:

“Show me tool calls where stale memory appeared in the decision path.”

That is the kind of black box I think agents need next. Not just a record of execution,
but a record of why execution was permitted.

Your article already has the right foundation for that because JSONL gives the run a
durable spine. Once authority metadata gets attached to those events, the trace becomes
much more than debugging. It becomes a decision audit.

Thread Thread
 
tahosin profile image
S M Tahosin

That's a really interesting way to frame it: not just "what happened?" but "what gave the agent permission to do it?"

The more I think about it, the more I agree that authority metadata could reveal an entire class of failures that normal traces miss. A bad action is often just the end of a much longer chain of accepted assumptions.

I especially like the idea of querying authority paths the same way we query execution paths. That starts moving the black box from debugging toward decision auditing, which feels like a natural next step for more capable agents.

Thread Thread
 
zep1997 profile image
Self-Correcting Systems

Yes, that “accepted assumptions” phrase is exactly the thing.

A lot of agent failures do not begin at the visible action. The bad tool call is just
where the chain finally becomes observable.

Before that, the system may have already accepted:

  • this memory is current
  • this note can govern
  • this policy applies to this scope
  • this tool is allowed under this context
  • this action does not need verification

If none of that authority path is recorded, the trace can tell us what happened but not
why the system believed it was permitted.

That is why I like the idea of treating authority as first-class trace data.

Execution path:

agent called tool X with input Y

Authority path:

tool X was allowed because memory A was active, policy B governed the action, gate C
passed, and no higher-authority rule blocked it

Once that exists, you can ask much better post-run questions:

  • which actions were governed by stale context?
  • which writes happened without a live source check?
  • which tool calls relied on provisional memory?
  • which policy admitted the action?
  • which authority layer was skipped?

That is the part that starts turning a black box into a decision audit.

The run trace should not only preserve what the agent did. It should preserve what the
agent thought it was allowed to do.

Thread Thread
 
tahosin profile image
S M Tahosin

I think you're getting at something really important here. We spend a lot of time tracing actions, but much less time tracing the assumptions that authorized those actions.

The distinction between execution history and authority history is becoming more interesting to me the more I think about it. If an agent can explain not only what it did, but which memory, policy, or verification path allowed it to do it, post-run analysis becomes much more powerful.

At that point, we're not just debugging behavior. We're auditing decisions.

Collapse
 
0xdevc profile image
NOVAInetwork

"That is not debugging. That is guessing with syntax highlighting" is the line that lands. The whole post is the working version of that distinction.

The DuckDB-over-JSONL move is the right shape for the single-process case because it inverts the typical observability tradeoff: most teams pay for a hosted stack to get queryability, but for one agent in one process, append-only JSONL plus a free SQL engine gets you 80% of the forensic value without a vendor. The 71-line constraint is what makes it shippable instead of yet another half-built observability platform.

One extension worth considering: the schema you've got captures WHAT happened (tool_start, tool_end, tool_error, guard_check) but not WHY the agent chose that tool with that input. The model's reasoning chain (which memory was retrieved, which policy was checked, which prior turn the decision was conditioned on) is the layer below your current trace. Most "the agent hallucinated" post-mortems hit a wall at exactly that gap: you can see the call, you can't see the deliberation.

Adding a tool_selection event before tool_start, with the retrieved context hash and the policy snapshot the agent was operating under, gives you a deliberation trace alongside the execution trace. Still 71 lines of recorder code; the schema does the work.

The provenance question gets harder when you cross process boundaries: multi-agent coordination, retries that span sessions, model versions changing under you. That's where the local-file model starts to break and you need either a content-addressed event store or something stronger. Different problem though. For the single-agent case you're describing, the JSONL+DuckDB pattern is correct.

Building toward the cross-process version on the protocol side at NOVAI. Same forensic question, different trust assumptions when there's no single process to own the log file. The local case you're solving is the right starting point.

Good post. The constraint is the contribution.

Collapse
 
tahosin profile image
S M Tahosin

Thanks. I really like the distinction you're making between execution traces and deliberation traces.

You're right that many investigations eventually hit the "I can see what happened, but not why it was chosen" wall. A tool_selection event with context or policy metadata would be a natural extension of the current approach without fundamentally changing the design.

And I agree about the scope. The article is very much focused on the single-agent, single-process case. Once you move into multi-agent systems and cross-process coordination, provenance becomes a much harder problem. But as you said, the local case feels like the right place to start before tackling the distributed one.

Collapse
 
jakesullivan profile image
Jake Sullivan

Really strong piece. What stands out is that you are not just logging failures, you are preserving the decision trail that caused them. That is the difference between guessing at a bad output and actually isolating where stale context, a wrong tool input, or a retry loop changed the run. The DuckDB part is especially good because it turns debugging into analysis, not archaeology. This is exactly the kind of pattern more agent systems should adopt early.

Collapse
 
tahosin profile image
S M Tahosin

Thanks, Jake. "Debugging into analysis, not archaeology" is a great way to describe it.

That was exactly the goal. Once the decision trail is preserved, you're no longer trying to reconstruct the run from memory or assumptions. You can simply query what actually happened.

Collapse
 
emmasofia profile image
Emma Sofia

Really strong pattern here. The part that stands out is not the 71 lines, it is the shift in mental model: once every run becomes an append-only event stream, debugging stops being guesswork and turns into a queryable history. I also like that redaction and guard stops are treated as first-class events, because that is what makes observability feel trustworthy instead of decorative. DuckDB is a sharp choice for this too since it keeps the whole workflow local, cheap, and easy to inspect without adding a heavy stack. This feels like a very practical baseline for anyone shipping tool-using agents, especially before the failures start costing real money.

Collapse
 
tahosin profile image
S M Tahosin

Thanks, Emma. I really like your point about observability being trustworthy instead of decorative.

That was one of the reasons I treated things like guard stops and redaction as events rather than side notes. If the goal is to understand what actually happened during a run, those decisions should be part of the record too. And yes, keeping everything local with DuckDB was a deliberate choice. I wanted something simple enough to adopt before the failures become expensive.

Collapse
 
emmasofia profile image
Emma Sofia

The "part of the record" idea is what clicked for me too. Once guard stops, redactions, and tool decisions are all queryable events, you can start asking much richer questions about agent behavior instead of reconstructing runs from logs after the fact.

Collapse
 
ashahin profile image
Abdullah Shahin

Flattening the critical fields (tool_name, turn_id, parent_event_id, latency_ms, tokens_in/out) into top-level columns at write time saves a lot of json_extract gymnastics in DuckDB later. First cross-day groupby is when you notice.

Loop detection is where this gets messy. Same tool_name with near-identical args can be either a real retry or actual progress when upstream context changed. A cheap hack that works: hash (tool_name, normalized_args, context_digest) per call, count collisions per turn window. False-positives on legitimate polling drop a lot.

Also, sanitize on tool inputs is the obvious case but tool outputs are where most agent traces leak secrets. The function-result branch is the one that catches people.

Collapse
 
tahosin profile image
S M Tahosin

Those are great points, Abdullah.

I especially agree about tool outputs. Most people think about sanitizing inputs, but outputs are often where sensitive data quietly ends up in traces.

The context_digest idea is interesting too. One thing I ran into was that a simple retry count doesn't tell you whether the agent is stuck or actually making progress. Factoring context into the fingerprint seems like a practical way to separate the two without adding much complexity.

You've definitely given me a few ideas for a future iteration of the black box.

Collapse
 
__5b6e8f677243ba4b2f60f profile image
Felix

This is such a creative approach — using DuckDB as a debugging query layer is something I haven't seen before. The $200 crash point is painfully relatable. One pattern I've found helpful is logging the full request/response for every LLM call (model, prompt, tokens, latency, error) to a SQLite db. It turns "mysterious crash" into "I can see exactly which model+prompt combo caused it." Nice to see someone pushing the debugging workflow forward!

Collapse
 
tahosin profile image
S M Tahosin

Thanks, Felix. The $200 crash was definitely the moment that convinced me I needed something more than traditional logs. 😅

I like your SQLite approach too. Being able to trace issues back to a specific model, prompt, and response combination is incredibly valuable. In the end, I think the common theme is making agent behavior inspectable instead of trying to debug from the final output alone.

Collapse
 
elsienora profile image
Elsie Nora

The way you integrated a compact “black box” into your Python agent and then leveraged DuckDB for querying a large crash dataset is really interesting. I appreciate how you balanced minimal code complexity with practical functionality, especially using only 71 lines to achieve what would usually require a more extensive pipeline. One point I found particularly clever was treating the crash dataset as an analytical layer rather than just raw logs, which opens opportunities for near real-time insights. It would be interesting to see how this approach scales when the dataset grows beyond the 200 records—do you think performance will hold, or would you consider chunking or indexing strategies?

Collapse
 
tahosin profile image
S M Tahosin

I really like how you described it as an analytical layer rather than just logs. That was exactly the mindset behind using DuckDB.

As for scale, I think DuckDB would comfortably handle far more than what I showed in the article. If traces grew significantly, I'd probably look at partitioning or archiving older events first, while keeping the event structure unchanged. The nice part is that the tracing approach stays simple even as the storage strategy evolves.

Collapse
 
valentin_monteiro profile image
Valentin Monteiro

The 71-line constraint is clever, but the column I'd add to that trace is cost. Knowing which tool call consumed how many tokens per step turns a debugging tool into a budget tool. The $200 crash gets a root cause and a price tag per decision.

Collapse
 
harjjotsinghh profile image
Harjot Singh

A 71-line black box that lets you query the crash with DuckDB afterward is a lovely example of the highest-ROI move in agent reliability: making the run inspectable after the fact. Agents fail in ways logs don't capture well, the interesting question is never just what threw, it's what was the state when it went wrong, and structured, queryable event capture turns a vague it broke into select what happened around the failure. The DuckDB angle is the clever bit, because it means the trace isn't just readable, it's analyzable: you can aggregate across many runs (which tool fails most, where tokens get burned, what precedes the bad outputs) instead of squinting at one log at a time, which is exactly how you go from anecdote to pattern. The thing I like most is the 71 lines, observability for agents has a reputation for needing a heavyweight platform, but a tiny structured event log you own often beats a vendor dashboard because you can query it however the incident demands. Capture structured events cheaply, then let SQL ask the questions you didn't anticipate. That make-the-run-queryable instinct is core to how I think about agent debugging in Moonshift. Are you logging one event per tool call, or finer-grained, capturing the model's inputs/outputs at each step so you can reconstruct the decision too?

Collapse
 
tahosin profile image
S M Tahosin

That's a great way to put it. The shift from "what failed?" to "what was happening when it failed?" was exactly what pushed me toward building the black box in the first place.

I also agree with your point about moving from anecdotes to patterns. Looking at a single failed run is useful, but being able to ask questions across many runs is where things get interesting. That's where DuckDB ended up providing far more value than I expected.

For the current version, I'm logging more than just tool calls. Each run captures lifecycle events, tool starts and ends, errors, timing, guard checks, and the associated inputs and outputs after sanitization. The goal is to reconstruct enough of the execution path to understand not only what the agent did, but why it ended up there.

What I'm not fully capturing yet is a richer view of the model's internal decision process between steps. That's probably the next layer I want to explore because, as you mentioned, the really interesting failures often happen before the tool error appears.

I'd be curious to hear how you're approaching this in Moonshift. Are you storing the reasoning trail as structured events too, or focusing primarily on tool and state transitions?

Collapse
 
byteharbor profile image
Jordan Miles

What stood out to me is that the black box is not really an observability layer, it's a change in how we model agent failures. Most teams still treat an agent as a prompt that occasionally calls tools, so when something goes wrong they inspect the final output. Your approach treats the entire run as an execution history that can be queried later.

I also like the decision to use JSONL + DuckDB instead of introducing a heavier telemetry stack. There is a sweet spot between print debugging and full distributed tracing, and many agent projects probably live there. The append only design means the trace survives the very failures you're trying to investigate.

One thing I'd be curious about: have you considered recording parent/child relationships between events? As agents become multi-agent systems or start spawning parallel tool calls, reconstructing causality becomes harder than identifying individual failures. A simple event graph could make the DuckDB queries even more powerful without adding much complexity.

The bigger lesson here is that cost overruns and hallucinations are often symptoms, not root causes. Once you can reconstruct the execution path, the conversation shifts from "the model did something weird" to "this exact decision chain produced this outcome." That is a much more useful place to debug from.

Collapse
 
tahosin profile image
S M Tahosin

Thanks, Jordan. I think you captured the core idea perfectly. The goal was to stop treating failures as isolated outputs and start treating them as outcomes of a traceable execution path.

I also like your point about parent/child relationships. The current version is intentionally simple, but event lineage becomes much more important once you introduce parallel tools or multiple agents. That's definitely an interesting direction for extending the black box without losing its lightweight nature.

And I completely agree with the last point. In many cases, the bad output is just the final symptom. The real value comes from being able to trace back and find the exact decision that set the run on the wrong path.

Collapse
 
ethanpark profile image
Ethan Park

The biggest win is not the 71 lines, it is the shift from postmortem guessing to a queryable execution record. Once every tool call, guard check, and failure is encoded as an event, debugging stops being “the model probably drifted” and becomes a concrete investigation into where the chain broke. I also like that you kept it local and lightweight instead of reaching for a heavy observability stack. That makes the pattern much easier to adopt in real projects, especially for agents where the expensive part is usually not the bug itself, but the time spent reconstructing the path that led to it.

Collapse
 
tahosin profile image
S M Tahosin

Exactly. The goal wasn't really the 71 lines, it was making the run inspectable. Once the execution becomes queryable, debugging shifts from theories to evidence.

I also wanted something lightweight enough that people could adopt without adding another platform to their stack. Thanks for the thoughtful insight.

Collapse
 
kevinlee12 profile image
Kevin Lee

Interesting angle here is that this feels less like logging and more like observability for agents. Most people inspect the final output when something breaks, but the real issue is often hidden in tool selection, retries, or context flow. Using DuckDB to query failures makes debugging feel much more structured. Curious if you’ve experimented with tracking the “reason” behind tool choices too, because wrong decisions can be harder to catch than obvious crashes.

Collapse
 
tahosin profile image
S M Tahosin

That's exactly how I see it too. A lot of agent failures don't show up as crashes, they show up as reasonable-looking decisions that were based on the wrong context or tool choice.

I haven't experimented much with recording the reasoning behind tool selection yet, but I think that's a really interesting direction. In many cases, understanding why a tool was chosen could be more valuable than knowing that it failed.