The WHY Layer: Log Why Your Agent Made Each Decision

#hermeschallenge #ai #python #agents

Your agent gave a bad recommendation. You look at the step log. You can see what it did: it called three tools, got results, and produced a response. What you cannot see is why it made each intermediate decision. Why did it choose tool B over tool A? Why did it conclude X from the data? Why did it not ask for clarification?

The WHAT layer (agent-step-log) captures actions. The WHY layer is separate. agent-decision-log captures reasoning at each decision point.

The Shape of the Fix

from agent_decision_log import DecisionLog

dlog = DecisionLog(path="./logs/decisions.jsonl")

# After each LLM turn, extract the reasoning and log it
response = call_llm(messages)
decision_text = response.content[0].text  # The model's visible reasoning

# Or prompt the model to explain its decision explicitly
reasoning_response = call_llm([
    *messages,
    {"role": "assistant", "content": response.content},
    {"role": "user", "content": "Before taking the next step, briefly explain: (1) what you decided to do, (2) why."}
])

dlog.record(
    run_id=run_id,
    step=step_counter,
    decision="search for Q3 financial data",
    why="The user asked about Q3 performance but no financial data is in context yet. Need to retrieve it.",
    action_taken="call_tool(search_documents, query='Q3 revenue')",
    confidence=0.9,
)

Every significant decision point has a structured log entry: what was decided, why, what action followed, and how confident the agent was.

What It Does NOT Do

agent-decision-log does not automatically extract decisions from LLM responses. You record decisions explicitly. The library provides the JSONL store and the record schema; you call record() at the right moments in your agent loop.

It does not evaluate whether the reasoning is correct. It records what the model said the reasoning was. Whether the stated reasoning is truthful, accurate, or complete is a separate evaluation question.

It does not create the WHY layer automatically for any LLM response. Many LLM responses do not involve explicit decisions (they just answer a question). Record decisions at steps where the agent chooses one path over alternatives: which tool to call, whether to ask for clarification, whether to stop or continue.

Inside the Library

Each decision record has a consistent schema:

import json
import time
import threading
from pathlib import Path

class DecisionLog:
    def __init__(self, path: str):
        self._path = Path(path)
        self._path.parent.mkdir(parents=True, exist_ok=True)
        self._lock = threading.Lock()

    def record(
        self,
        run_id: str,
        step: int,
        decision: str,
        why: str,
        action_taken: str | None = None,
        confidence: float | None = None,
        alternatives_considered: list[str] | None = None,
        metadata: dict | None = None,
    ) -> None:
        entry = {
            "run_id": run_id,
            "step": step,
            "decision": decision,
            "why": why,
            "action_taken": action_taken,
            "confidence": confidence,
            "alternatives_considered": alternatives_considered or [],
            "metadata": metadata or {},
            "ts": time.time(),
        }

        with self._lock:
            with self._path.open("a") as f:
                f.write(json.dumps(entry) + "\n")

    def load_for_run(self, run_id: str) -> list[dict]:
        if not self._path.exists():
            return []

        entries = []
        for line in self._path.read_text().splitlines():
            if line.strip():
                entry = json.loads(line)
                if entry["run_id"] == run_id:
                    entries.append(entry)

        return sorted(entries, key=lambda e: (e["step"], e["ts"]))

    def load_all(self) -> list[dict]:
        if not self._path.exists():
            return []

        return [
            json.loads(line)
            for line in self._path.read_text().splitlines()
            if line.strip()
        ]

The alternatives_considered field is important for auditing: when the agent decided to call tool A, what other options did it consider? If the agent skipped a tool that would have answered the question correctly, that shows in the alternatives.

When to Use It

Use it for agents in regulated or audited environments. Legal, medical, and financial agents need decision audit trails. Why did the agent recommend this course of action? The decision log answers that question.

Use it for debugging bad agent behavior. When a user complains about a wrong recommendation, load the decision log for that run. The chain of reasoning from input to output is visible: each step, each why, each alternative that was or was not considered.

Use it for agent quality improvement. Analyze decision logs across many runs to find patterns: which decisions have low confidence? Where does the agent frequently consider alternatives but choose the wrong one? These patterns guide prompt improvements.

Skip it for simple question-answering agents with no intermediate decisions. If the agent receives a question and produces an answer in one LLM call with no tool use and no branching, there is no decision to log.

Install

pip install git+https://github.com/MukundaKatta/agent-decision-log

# Or from PyPI
pip install agent-decision-log

from agent_decision_log import DecisionLog

dlog = DecisionLog(path="./logs/decisions.jsonl")

def run_with_decision_logging(task: str, run_id: str) -> str:
    messages = [{"role": "user", "content": task}]
    step = 0

    while True:
        response = call_llm(messages)
        step += 1

        if response.stop_reason == "tool_use":
            for block in get_tool_calls(response):
                # Log the tool-selection decision
                dlog.record(
                    run_id=run_id,
                    step=step,
                    decision=f"call tool '{block.name}'",
                    why=f"Tool called with args: {block.input}",
                    action_taken=f"execute_tool({block.name})",
                    confidence=0.85,
                )

                result = execute_tool(block.name, block.input)
                messages.append(build_tool_result(block.id, result))

        elif response.stop_reason == "end_turn":
            dlog.record(
                run_id=run_id,
                step=step,
                decision="end conversation",
                why="Model determined the task is complete (stop_reason=end_turn)",
                action_taken="return response",
            )
            return extract_text(response)

        messages.append({"role": "assistant", "content": response.content})

Sibling Libraries

Library	What it solves
`agent-step-log`	WHAT layer: record actions and token counts
`agent-citation`	WHERE layer: record source attribution
`agent-debug-replay`	Navigate step + decision logs together
`agent-run-id`	Correlation IDs to link decision records to runs
`agent-event-bus`	Publish decisions as events for real-time subscribers

The agent observability stack: agent-run-id for correlation, agent-step-log for what happened, agent-decision-log for why it happened, agent-citation for where information came from, agent-debug-replay for navigation.

What's Next

Decision quality scoring: dlog.score_run(run_id, rubric) that takes a rubric and scores each decision for clarity, completeness, and logical consistency. Useful for identifying which decisions in a run were well-reasoned vs. ad hoc.

Structured decision extraction: instead of asking the agent to write free-form "why" text, provide a structured prompt template that extracts decision, alternatives_considered, and confidence from the model's response in JSON. Structured extraction is more consistent and easier to analyze.

Decision graph: dlog.to_graph(run_id) that builds a directed graph of decisions and their outcomes. Useful for visualizing how the agent's reasoning path led to the final result, and for identifying where diverging paths could have led to different outcomes.

Built as part of the agent-stack family: composable Python primitives for production LLM agents.