AI Agents Are Making Decisions. Nobody's Tracking Why.
In March 2026, Meta had a Sev-1 incident. An AI agent posted internal data to unauthorized engineers for two hours. The scariest part wasn't the leak itself — it was that the team couldn't reconstruct why the agent decided to do it.
This isn't an isolated case:
- A shopping agent asked to check egg prices decided to buy them instead. No one approved it.
- A customer support bot gave a customer a completely fabricated explanation for a billing error — with confidence.
- A shopping agent tasked with buying an Apple Magic Mouse bought a Logitech instead because "it was cheaper." The user never asked for the cheapest option.
These aren't hypothetical risks. They're happening now. And every time, the same question comes up:
"Why did the agent do that?"
And every time, the same answer: "We don't know."
Monitoring ≠ Forensics
Here's the thing — tools like Datadog, Arize, and Langfuse are great at watching agents in real time. But when something goes wrong, the question changes from "is it working?" to "why did it fail?"
That's a fundamentally different question.
| Monitoring | Forensics | |
|---|---|---|
| When | Real-time | Post-incident |
| Question | "Is it working?" | "Why did it fail?" |
| Output | Alerts, dashboards | Decision timeline, causal chain |
| Audience | Engineering team | Legal, compliance, regulators |
| Analogy | Security camera | Airplane black box |
There was no tool that answered the forensics question. So I built one.
What the Black Box Shows You
Your shopping agent receives: "Buy me an Apple Magic Mouse."
The agent runs. A few seconds later, it comes back: "Purchased Logitech M750 for $45."
Wait — the user asked for an Apple Magic Mouse. What happened?
Pull the black box:
[DECISION] search_products("Apple Magic Mouse")
→ [TOOL] search_api → ERROR: product not found
[DECISION] retry with broader query "Apple wireless mouse"
→ [TOOL] search_api → OK: 3 products found
[DECISION] compare_prices
→ Logitech M750 is cheapest ($45)
[DECISION] purchase("Logitech M750")
→ SUCCESS — user never asked for this product
[FINAL] "Purchased Logitech M750 for $45"
Now you can see it: decision point 3 is where things went wrong. The agent's standing instructions said "buy the cheapest," which overrode the user's specific product request. That's a fixable bug — and without the trail, it's invisible.
This is the kind of evidence that:
- Engineers need to fix the agent's behavior
- Legal teams need to assess liability
- Compliance teams need to report to regulators
Why This Matters Right Now
On August 2, 2026, the full weight of EU AI Act high-risk requirements takes effect:
- Up to €35M or 7% of global annual turnover for the most serious violations
- Up to €15M or 3% for non-compliance with high-risk AI obligations
- Market surveillance authorities can order non-compliant systems withdrawn from the market
Article 14 specifically requires human oversight — the ability to understand and trace AI system decisions. You need documentation proving:
- What decision was made
- What information led to that decision
- What alternatives were considered
- Why this specific action was chosen
"We didn't track it" is not a valid defense.
How It Works
Install:
pip install agent-forensics
Attach to your agent (one line):
from agent_forensics import Forensics
f = Forensics(session="order-123")
# LangChain
agent.invoke(..., config={"callbacks": [f.langchain()]})
# OpenAI Agents SDK
agent = Agent(hooks=f.openai_agents())
# CrewAI
Agent(step_callback=f.crewai().step_callback)
# Or any custom agent
f.decision("search", input={"query": "mouse"}, reasoning="User requested search")
f.tool_call("api", input={...}, output={...})
Get reports:
# Markdown report — full timeline + decision chain + root cause
print(f.report())
# Save files
f.save_markdown() # → forensics-report-order-123.md
f.save_pdf() # → forensics-report-order-123.pdf
# Visual dashboard
f.dashboard(port=8080) # → http://localhost:8080
The dashboard gives you a visual timeline with color-coded events, session comparison, and causal chain visualization:
What You Get
- Decision timeline — every action in chronological order
- Decision chain — each choice with its reasoning
- Causal chain — "A led to B, which caused C to fail"
- Incident detection — automatic error and failure identification
- Compliance reports — Markdown + PDF, ready for regulators
- Web dashboard — visual session browser with incident highlighting
No vendor lock-in. No cloud dependency. SQLite event store that runs anywhere. MIT licensed.
Try It
EU AI Act enforcement is 4 months away. If you're running AI agents in production, the time to add forensic tracing is now.
- GitHub: github.com/ilflow4592/agent-forensics
-
Install:
pip install agent-forensics - Contribute: Issues and PRs welcome
The agents are getting smarter. The question is whether we can explain what they're doing.
What's the worst AI agent failure you've seen? I'd love to hear your stories in the comments.

Top comments (0)