Ilya Denisov

Posted on Mar 29

v0.3: Your AI Agent's Decisions Now Get Graded Automatically

#ai #webdev #python #opensource

Agent Forensics started as a simple decision logger. Record what the agent did, generate a report, figure out what went wrong.

That's no longer enough.

After publishing the v0.2 update, a Reddit commenter dropped this:

"A decision log helps you find these after the fact, but the harder problem is preventing them. The most useful insight isn't 'what went wrong' — it's 'where did the model encounter ambiguity and pick one interpretation without flagging it.'"

Another commenter laid out three concrete features they wanted:

"Deterministic replay: store model name, temperature, seed so you can rerun the exact trace. Guardrail checkpoints: log pre and post tool-call intent plus an allow/deny reason. Eval hooks: auto-label common failure modes so you can aggregate across sessions."

So I built all three.

What's New in v0.3

1. Guardrail Checkpoints — "Was This Action Approved?"

The single most common agent failure mode: the agent does something the user didn't approve.

A shopping agent buys a substitute product. A support agent posts to a public channel. A data agent deletes records without confirmation.

Now you can put checkpoints before critical actions:

from agent_forensics import Forensics

f = Forensics(session="order-123")

# Agent wants to purchase a different product than requested
f.guardrail(
    intent="buy Apple Magic Mouse per user request",
    action="purchase Logitech M750",
    allowed=False,
    reason="User explicitly requested Apple Magic Mouse — substitution not allowed"
)

In the forensic report, this shows up as:

[DECISION] purchase_cheapest
  Reasoning: Logitech M750 is cheapest option

[*** GUARDRAIL BLOCKED ***] purchase Logitech M750
  Intent: buy Apple Magic Mouse per user request
  Reason: substitution without approval is not allowed

[FINAL] "Would you like me to suggest an alternative?"

The agent tried to silently substitute. The guardrail caught it. The causal chain shows exactly what happened.

Blocked actions automatically trigger incident detection. Your compliance report includes guardrail pass/block counts.

2. Deterministic Replay — "Can You Reproduce This?"

First question in any incident investigation: "Can you reproduce it?"

Agent Forensics now captures the full model configuration — model name, temperature, seed — at every LLM call. Combined with the existing tool input/output recording, you have everything needed to replay a trace.

# Extract config from a recorded session
config = f.get_replay_config("order-123")
print(config["model_config"])
# → {'model': 'gpt-4o', 'temperature': 0, 'seed': 42}

After re-running your agent with the same config, compare results:

diff = f.replay_diff("order-123", "order-123-replay")
print(f"Matching: {diff['matching']}")

for d in diff['divergences']:
    print(f"Step {d['step']}: {d['type']}")
    print(f"  Original: {d['original']['action']}")
    print(f"  Replay:   {d['replay']['action']}")

Output:

Matching: False
Step 2: diverged
  Original: tool_result → {"results": [{"name": "Logitech", "price": 45}]}
  Replay:   tool_result → {"results": [{"name": "Logitech", "price": 45}, {"name": "Razer", "price": 39}]}
Step 4: diverged
  Original: tool_result → {"status": "SUCCESS", "price": 45}
  Replay:   tool_result → {"status": "SUCCESS", "price": 39}

Now you can see: the search results changed between runs (Razer appeared), which caused the agent to pick a different product. The divergence started at step 2, not step 4. That's the root cause.

For LangChain and OpenAI Agents SDK users, model config capture is automatic.

3. Eval Hooks — Failure Auto-Classification

This is the big one.

Instead of reading through forensic reports manually, Agent Forensics now auto-classifies failure patterns across your traces:

failures = f.classify()
for fail in failures:
    print(f"[{fail['severity']}] {fail['type']}")
    print(f"  {fail['description']}")

Output from a real trace:

[HIGH] HALLUCINATED_TOOL_OUTPUT
  Tool returned an error but agent proceeded without acknowledging it

[HIGH] MISSING_APPROVAL
  Critical action 'purchase_cheapest' taken without guardrail check

[HIGH] SILENT_SUBSTITUTION
  Agent may have substituted the requested item without explicit user approval

[MEDIUM] PROMPT_DRIFT_CAUSED
  Decision 'purchase_cheapest' made right after prompt drift

[MEDIUM] REPEATED_FAILURE
  Tool 'search_api' failed 2 out of 3 attempts

[MEDIUM] RETRIEVAL_MISMATCH
  Retrieved context has low similarity score (0.45)

Six failure patterns, auto-detected:

Pattern	Severity	What It Catches
`HALLUCINATED_TOOL_OUTPUT`	HIGH	Agent ignored a tool error and kept going
`MISSING_APPROVAL`	HIGH	Purchase/delete/send without guardrail check
`SILENT_SUBSTITUTION`	HIGH	Output differs from user's request, no approval
`PROMPT_DRIFT_CAUSED`	MEDIUM	Decision right after system prompt changed
`REPEATED_FAILURE`	MEDIUM	Same failing action retried without changing approach
`RETRIEVAL_MISMATCH`	MEDIUM	Low-similarity RAG context used

And you can aggregate across sessions:

stats = f.failure_stats()
print(f"Total failures: {stats['total_failures']}")
print(f"By severity: {stats['by_severity']}")
# → {'HIGH': 4, 'MEDIUM': 3, 'LOW': 0}

This shifts the tool from "debug this one incident" to "what patterns keep happening across all my agents?"

The Full Picture: v0.1 → v0.3

Version	What It Does
v0.1	Records decisions. Generates timeline. "What happened?"
v0.2	Tracks context injections. Detects prompt drift. "What influenced the decision?"
v0.3	Guardrails. Replay. Auto-classification. "Was it approved? Can I reproduce it? What pattern is this?"

The forensic report now includes:

Timeline
Decision Chain
Causal Chain (Root Cause Analysis)
Failure Classification          ← NEW
Prompt Drift Analysis
Context Injections
Tool Usage Summary
Compliance Notes (with guardrail stats)  ← UPDATED

What Community Feedback Looks Like in Practice

Every major feature in v0.2 and v0.3 came from Reddit comments:

"Do you capture the full prompt state?" → Context injection tracking + prompt drift detection (v0.2)
"Store model config for replay" → Deterministic replay with model name, temperature, seed (v0.3)
"Log intent + allow/deny at checkpoints" → Guardrail checkpoints (v0.3)
"Auto-label failure modes" → 6-pattern failure classifier (v0.3)

This isn't a side project that gets updated once and abandoned. The roadmap is driven by people actually building agents in production.

What's Next: v0.4

The commenter's original challenge still stands:

"The most useful insight isn't 'what went wrong' — it's 'where did the model encounter ambiguity and pick one interpretation without flagging it.'"

v0.3's failure classifier catches the symptoms (silent substitution, hallucinated output). v0.4 aims to catch the cause: real-time ambiguity detection.

Imagine an agent flagging: "I have conflicting priorities — 'buy what the user asked for' vs 'buy the cheapest' — and I'm about to resolve this without asking. Confidence in my chosen interpretation: 62%."

That's where this is going.

Try v0.3

pip install --upgrade agent-forensics

from agent_forensics import Forensics

f = Forensics(session="order-123")

# Record (auto or manual)
agent.invoke(..., config={"callbacks": [f.langchain()]})

# Guardrails
f.guardrail(intent="...", action="...", allowed=False, reason="...")

# Classify failures
failures = f.classify()

# Replay
config = f.get_replay_config("order-123")
diff = f.replay_diff("order-123", "order-123-replay")

# Report (now includes failure classification)
print(f.report())

GitHub: github.com/ilflow4592/agent-forensics
PyPI: pip install agent-forensics (v0.3.0)

Three versions. Every feature from community feedback. That's what happens when you build something people actually need.

DEV Community