My agent made a $400 mistake. I had no idea why. So I built a WHY layer.

#hermeschallenge #ai #python #agents

It was a billing run. The agent was supposed to classify a list of line items, group them by category, and write a summary CSV. Simple enough. We had run it a dozen times.

On the thirteenth run it did something different. It called the summarize tool before classifying half the items. The output was wrong. The summary was missing a whole category. A human caught it three days later when reconciling against the source data. The error had propagated into two downstream reports.

I went back to look at the trace. I had logs from agentsnap for the tool calls: what functions were invoked, in what order, with what arguments. I had cost and latency from agenttrace. I knew exactly what happened in the mechanical sense.

I had no idea why.

Did the model decide the partial classification was good enough? Did it miss the remaining items? Did some intermediate result look like a termination condition? Did it pick between summarize-now and classify-more and choose wrong? I could not tell. The trace was a sequence of calls with no record of the decisions behind them.

agent-decision-log is what I built so the next time I have a paper trail.

The shape of the fix

At every branch point in the loop, before you act, you log the decision.

from agent_decision_log import DecisionLog, Decision, Outcome, RationaleQuality, InMemorySink

sink = InMemorySink()
log = DecisionLog(sink=sink)

# The agent is choosing what to do next.
decision = Decision(
    step="classify_or_summarize",
    options=["classify_remaining", "summarize_now", "abort"],
    chosen="classify_remaining",
    rationale="14 items remain unclassified; summarizing now would miss them",
    confidence=0.92,
    quality=RationaleQuality.EXPLICIT,
)
log.record(decision)

# ... later, after you know how it turned out ...
log.resolve(decision, outcome=Outcome.SUCCESS, notes="All 14 items classified before summarizing")

If the model chose summarize_now instead, that record is in the log. If the rationale was thin or missing, RationaleQuality.IMPLICIT or RationaleQuality.INFERRED signals that. When the run is over, you replay the decision log and see every fork.

for entry in sink.entries():
    print(entry.step, entry.chosen, entry.outcome, entry.quality)

You can also write to disk with JsonlSink:

from agent_decision_log import JsonlSink

with JsonlSink("run-20260524.jsonl") as sink:
    log = DecisionLog(sink=sink)
    # ... run the agent ...

Each line is a JSON object. You can grep it, load it in pandas, or pipe it into any observability tool you already have.

What it does NOT do

It does not make decisions for you. The library records what your code already decided. The judgment stays with your agent.
It does not replay the agent. It captures the decision record from one run. Replaying that run against a different model is a different problem.
It does not integrate with any specific LLM SDK. You call log.record() yourself. Works with the Anthropic SDK, the OpenAI SDK, or raw HTTP, because it has no dependency on any of them.
It does not score rationale quality automatically. The RationaleQuality enum is for you to set based on how confident you are that the recorded rationale actually drove the choice.

Inside the lib: one design choice worth showing

The interesting case is when the model invokes an option that was never listed as a candidate.

You listed ["classify_remaining", "summarize_now", "abort"]. The model called a tool you did not anticipate. Or it combined two options. Or it invented a new path.

A strict library would reject that. "Option not in the candidate set, raise an error." That sounds reasonable until you realize that the hallucinated branch is the most interesting data point you have. The model went off-script. You want that in the log, not silently dropped.

The library promotes it. If chosen is not in options, the library appends the chosen value to the options list and marks the decision with promoted=True. The record is preserved. The sink gets it. You can query for all promoted decisions across a run and that is your list of places the model improvised.

from agent_decision_log import DecisionLog, Decision, InMemorySink

sink = InMemorySink()
log = DecisionLog(sink=sink)

d = Decision(
    step="tool_selection",
    options=["search", "summarize"],
    chosen="web_fetch",       # not in options
    rationale="model decided to fetch a page directly",
)
log.record(d)

promoted = [e for e in sink.entries() if e.promoted]
print(f"{len(promoted)} hallucinated-option promotions in this run")

In a billing run like mine, a promoted decision at classify_or_summarize with chosen="summarize_now" when the candidates listed only "classify_remaining" would have been the exact signal I needed. One grep, root cause found.

When this is useful

Post-incident forensics. When a run went wrong and you need to know which fork caused it, not just what calls were made.
Prompt tuning. You have two system prompts. You want to compare how often each one leads to the right branch at each step. The decision log gives you branch statistics without manual inspection.
Compliance and audit requirements. Some regulated environments need a written rationale for every consequential decision. The log is that record.
Debugging loops that seem to run fine but produce wrong output. Call traces look clean. The decision log reveals that a step was chosen with quality=INFERRED every time, meaning the rationale was weak and the choice was fragile.

When this is NOT what you want

You want automatic rationale extraction from LLM output. This library does not parse model reasoning tokens or chain-of-thought text. If you want to extract rationale from the model's own output, you need a separate extraction step before calling log.record().
You need distributed tracing with span IDs and parent-child relationships across services. The log is per-run and flat. For distributed agent systems, pair it with an observability layer that handles trace propagation.
You want it to catch bad decisions in real time and halt the agent. The library is observational. It records; it does not intervene.

Install

pip install agent-decision-log

Zero dependencies. Python 3.9+.

Repo: https://github.com/MukundaKatta/agent-decision-log

Sibling libraries

Lib	Boundary	Repo
agent-decision-log	WHY each decision was made	https://github.com/MukundaKatta/agent-decision-log
agentsnap	WHAT tool calls happened	https://github.com/MukundaKatta/agentsnap
agenttrace	COST and LATENCY per run	https://github.com/MukundaKatta/agenttrace
agent-citation	WHERE each output claim came from	https://github.com/MukundaKatta/agent-citation
agent-replay-trace	Load and step through JSONL traces	https://github.com/MukundaKatta/agent-replay-trace

What's next

Two things. First, a DecisionDiff helper that takes two JSONL files from runs against different prompts or models and produces a branch-by-branch comparison. Which steps diverged? Which options were chosen more often? What was the outcome distribution per step?

Second, a promoted_only() convenience query on InMemorySink that returns just the hallucinated-option records from a run. Right now you filter with a list comprehension. That should be a one-liner.

The billing agent now has decision logging at every branch. The next time it goes off-script, I will know which fork it took, what options it was given, what it said the reason was, and whether the option it chose was even on the list.