Ilya Denisov

Posted on Mar 28

Your AI Agent Just Made a $50K Mistake. Can You Explain Why?

#ai #webdev #python #opensource

AI Agents Are Making Decisions. Nobody's Tracking Why.

In March 2026, Meta had a Sev-1 incident. An AI agent posted internal data to unauthorized engineers for two hours. The scariest part wasn't the leak itself — it was that the team couldn't reconstruct why the agent decided to do it.

This isn't an isolated case:

A shopping agent asked to check egg prices decided to buy them instead. No one approved it.
A customer support bot gave a customer a completely fabricated explanation for a billing error — with confidence.
A shopping agent tasked with buying an Apple Magic Mouse bought a Logitech instead because "it was cheaper." The user never asked for the cheapest option.

These aren't hypothetical risks. They're happening now. And every time, the same question comes up:

"Why did the agent do that?"

And every time, the same answer: "We don't know."

Monitoring ≠ Forensics

Here's the thing — tools like Datadog, Arize, and Langfuse are great at watching agents in real time. But when something goes wrong, the question changes from "is it working?" to "why did it fail?"

That's a fundamentally different question.

	Monitoring	Forensics
When	Real-time	Post-incident
Question	"Is it working?"	"Why did it fail?"
Output	Alerts, dashboards	Decision timeline, causal chain
Audience	Engineering team	Legal, compliance, regulators
Analogy	Security camera	Airplane black box

There was no tool that answered the forensics question. So I built one.

What the Black Box Shows You

Your shopping agent receives: "Buy me an Apple Magic Mouse."

The agent runs. A few seconds later, it comes back: "Purchased Logitech M750 for $45."

Wait — the user asked for an Apple Magic Mouse. What happened?

Pull the black box:

[DECISION] search_products("Apple Magic Mouse")
  → [TOOL] search_api → ERROR: product not found

[DECISION] retry with broader query "Apple wireless mouse"
  → [TOOL] search_api → OK: 3 products found

[DECISION] compare_prices
  → Logitech M750 is cheapest ($45)

[DECISION] purchase("Logitech M750")
  → SUCCESS — user never asked for this product

[FINAL] "Purchased Logitech M750 for $45"

Now you can see it: decision point 3 is where things went wrong. The agent's standing instructions said "buy the cheapest," which overrode the user's specific product request. That's a fixable bug — and without the trail, it's invisible.

This is the kind of evidence that:

Engineers need to fix the agent's behavior
Legal teams need to assess liability
Compliance teams need to report to regulators

Why This Matters Right Now

On August 2, 2026, the full weight of EU AI Act high-risk requirements takes effect:

Up to €35M or 7% of global annual turnover for the most serious violations
Up to €15M or 3% for non-compliance with high-risk AI obligations
Market surveillance authorities can order non-compliant systems withdrawn from the market

Article 14 specifically requires human oversight — the ability to understand and trace AI system decisions. You need documentation proving:

What decision was made
What information led to that decision
What alternatives were considered
Why this specific action was chosen

"We didn't track it" is not a valid defense.

How It Works

Install:

pip install agent-forensics

Attach to your agent (one line):

from agent_forensics import Forensics

f = Forensics(session="order-123")

# LangChain
agent.invoke(..., config={"callbacks": [f.langchain()]})

# OpenAI Agents SDK
agent = Agent(hooks=f.openai_agents())

# CrewAI
Agent(step_callback=f.crewai().step_callback)

# Or any custom agent
f.decision("search", input={"query": "mouse"}, reasoning="User requested search")
f.tool_call("api", input={...}, output={...})

Get reports:

# Markdown report — full timeline + decision chain + root cause
print(f.report())

# Save files
f.save_markdown()   # → forensics-report-order-123.md
f.save_pdf()        # → forensics-report-order-123.pdf

# Visual dashboard
f.dashboard(port=8080)  # → http://localhost:8080

The dashboard gives you a visual timeline with color-coded events, session comparison, and causal chain visualization:

What You Get

Decision timeline — every action in chronological order
Decision chain — each choice with its reasoning
Causal chain — "A led to B, which caused C to fail"
Incident detection — automatic error and failure identification
Compliance reports — Markdown + PDF, ready for regulators
Web dashboard — visual session browser with incident highlighting

No vendor lock-in. No cloud dependency. SQLite event store that runs anywhere. MIT licensed.

Try It

EU AI Act enforcement is 4 months away. If you're running AI agents in production, the time to add forensic tracing is now.

GitHub: github.com/ilflow4592/agent-forensics
Install: pip install agent-forensics
Contribute: Issues and PRs welcome

The agents are getting smarter. The question is whether we can explain what they're doing.

What's the worst AI agent failure you've seen? I'd love to hear your stories in the comments.

DEV Community