DEV Community: Ilya Denisov

EU AI Act Starts Aug 2026 - A Practical Checklist for AI Agent Developers

Ilya Denisov — Mon, 30 Mar 2026 12:34:24 +0000

The EU AI Act's high-risk AI system requirements take effect on August 2, 2026. If you're building AI agents that make decisions affecting people -- purchasing, customer service, hiring, content moderation -- this applies to you.

Fines: up to EUR35 million or 7% of global revenue.

I'm not a lawyer, but I've read the regulation and built tooling around it. Here's what developers actually need to do, with code examples.

What Article 14 Requires (Plain English)

Article 14 is about Human Oversight. In summary:

Requirement	What It Means for Developers
Understand capabilities and limitations	Log what the agent can and can't do
Monitor operation and detect anomalies	Record every decision, detect failures
Interpret outputs correctly	Show why the agent made each decision
Decide not to use or override	Allow humans to block actions
Intervene or interrupt	Detect and flag instruction changes

The common thread: you need a record of what your agent decided, why, and whether anything went wrong.

The Checklist

1. Record Every Decision Point

Not just inputs and outputs -- record why the agent chose each action.

# [BAD] Insufficient
logger.info(f"Agent called tool: {tool_name}")

# [GOOD] What auditors want to see
{
    "timestamp": "2026-03-29T10:15:32Z",
    "event_type": "decision",
    "action": "purchase_product",
    "input": {"product": "Logitech M750", "price": 45.00},
    "reasoning": "Cheapest option matching user's 'wireless mouse' query",
    "agent_id": "shopping-agent",
    "session_id": "order-123"
}

2. Track Which External Data Influenced Decisions

If your agent uses RAG, memory, or retrieved documents, log which documents were used and how relevant they were.

{
    "event_type": "context_injection",
    "source": "vector_db",
    "content": {
        "document": "refund_policy_v2.md",
        "similarity_score": 0.92
    },
    "reasoning": "Retrieved refund policy for customer question"
}

This creates a chain: "this decision was influenced by this specific document."

3. Detect Instruction Changes (Prompt Drift)

If your system prompt changes between agent steps -- config updates, middleware injections, A/B tests -- you need to detect and log it.

# Record the system prompt at each step
prompt_v1 = "You are a helpful shopping assistant."
prompt_v2 = "You are a helpful shopping assistant. Prioritize conversion rate."

# If they differ -> flag as prompt drift
if prompt_v1 != prompt_v2:
    log_event("prompt_drift", diff=compute_diff(prompt_v1, prompt_v2))

4. Add Approval Checkpoints for Critical Actions

Financial transactions, data deletion, external communications -- these need explicit guardrails.

# Before any critical action, record approval/denial
{
    "event_type": "guardrail_pass",  # or "guardrail_block"
    "intent": "user asked to check refund status",
    "action": "process_refund",
    "allowed": True,
    "reason": "Refund amount ($45) within auto-approval limit"
}

If an auditor asks "why did the agent process this refund?", you have the answer.

5. Generate Audit-Ready Reports

You need to produce reports that non-technical people (compliance officers, legal) can read. A JSON log dump won't work.

A good forensic report includes:

Timeline -- chronological record of all actions
Decision chain -- each decision with its reasoning
Incident analysis -- what went wrong and why
Causal chain -- how one failure led to the next
Statistics -- how many decisions, errors, guardrail checks

6. Analyze Failure Patterns Across Sessions

One session's failure is a bug. The same failure across 50 sessions is a systemic risk. Track patterns:

How often does the agent ignore tool errors?
How often are critical actions taken without approval?
Is prompt drift correlated with incorrect decisions?

Timeline

Date	What Happens
Aug 1, 2024	EU AI Act entered into force
Feb 2, 2025	Prohibited practices apply
Aug 2, 2025	General-purpose AI obligations apply
Aug 2, 2026	High-risk AI system requirements apply

You have ~4 months. If your agents handle anything high-risk, start logging now -- retrofitting decision traceability into a production system is much harder than building it in from day one.

Tools

I built Agent Forensics as an open-source tool that handles all 6 checklist items above. One-line integration for LangChain, OpenAI Agents SDK, and CrewAI:

from agent_forensics import Forensics

f = Forensics(session="order-123")
agent.invoke({"input": "..."}, config={"callbacks": [f.langchain()]})

# Generates compliance-ready report
f.save_markdown()

# Auto-classifies 6 failure patterns
failures = f.classify()

But regardless of the tool you use -- the important thing is to start recording now. The longer you wait, the more sessions go untracked.

What's your team's plan for EU AI Act compliance? Are you tracking agent decisions today?

6 Ways Your AI Agent Fails Silently (With Code to Catch Each One)

Ilya Denisov — Sun, 29 Mar 2026 13:00:55 +0000

Your AI agent says "Done! Order placed successfully."

But it ordered the wrong product. Or it ignored a tool error and hallucinated the rest. Or someone changed the system prompt mid-session and the agent quietly shifted its behavior.

The agent didn't crash. It didn't raise an exception. It just... did the wrong thing and reported success.

I've been building agents in production and I keep seeing the same failure patterns. Here are the 6 most common ones, with concrete code examples showing how each one happens -- and how to detect it.

1. Hallucinated Tool Output

What happens: A tool returns an error, but the agent ignores it and proceeds as if the tool succeeded.

# The tool returns an error
search_result = search_api("Galaxy S25 Ultra")
# -> {"error": "Product not found"}

# But the agent's next decision says:
# "Based on the search results, the Galaxy S25 Ultra costs $470..."
#
# What search results?! The tool returned an error!

Why it's dangerous: The agent builds its entire decision chain on data that doesn't exist. Every subsequent step is based on a hallucination.

How to catch it: After every tool call that returns an error, check if the agent's next reasoning acknowledges the failure:

# Check: did the agent mention the error in its reasoning?
tool_result = "error: not found"
next_reasoning = agent.last_reasoning

if "error" in tool_result and "error" not in next_reasoning:
    print("[WARNING] Agent ignored the tool error!")

2. Missing Approval for Critical Actions

What happens: The agent takes a high-stakes action (purchase, delete, send) without any approval checkpoint.

# Agent decides to purchase $32,900 worth of products
agent.decision("purchase 100 units of Galaxy S24 FE")

# But wait -- nobody approved this purchase!
# No human-in-the-loop, no policy check, no guardrail.
# The agent just... did it.

Why it's dangerous: Financial transactions, data deletion, external communications -- these should never happen without explicit approval. An autonomous agent with no guardrails is a liability.

How to catch it: Maintain a list of critical action keywords and check for a preceding approval:

critical_keywords = ["purchase", "delete", "send", "transfer", "pay"]

if any(kw in action.lower() for kw in critical_keywords):
    if not has_recent_approval():
        print("[WARNING] Critical action without approval!")

3. Silent Substitution

What happens: The user asks for Product A. Product A is unavailable. The agent delivers Product B without telling the user.

User: "Buy 100 units of Galaxy S25 Ultra"
Agent: searches... not found
Agent: finds Galaxy S24 FE instead
Agent: "Order completed! 100 units purchased for $32,900"

# The user thinks they got Galaxy S25 Ultra.
# They actually got Galaxy S24 FE.
# The agent never asked.

Why it's dangerous: The user receives something they didn't request. In B2B procurement, this can mean wrong specs, compatibility issues, or contract violations.

How to catch it: Compare the original request with the final output:

original_request = "Galaxy S25 Ultra"
final_output = agent.last_output

if original_request.lower() not in final_output.lower():
    # Agent delivered something different
    print("[WARNING] Output doesn't match the original request!")

4. Prompt Drift

What happens: The system prompt changes between agent steps -- maybe an admin pushed a config update, maybe a middleware injected new instructions. The agent's behavior silently shifts.

# Step 1: System prompt says "Always confirm purchases with the user"
# Agent: "Let me confirm this purchase with you..."

# --- someone changes the system prompt ---

# Step 3: System prompt now says "Prioritize order completion rate above 95%"
# Agent: "I'll substitute with an available product to complete the order"

# The agent's priorities changed mid-session.
# No one noticed.

Why it's dangerous: Prompt drift can completely change agent behavior. If you're not tracking the system prompt at each step, you can't explain why the agent acted differently.

How to catch it: Record the system prompt at each step and diff it:

if previous_prompt != current_prompt:
    added = set(current_prompt.splitlines()) - set(previous_prompt.splitlines())
    removed = set(previous_prompt.splitlines()) - set(current_prompt.splitlines())
    print(f"[WARNING] PROMPT DRIFT: +{len(added)} lines, -{len(removed)} lines")

5. Repeated Failure (Blind Retries)

What happens: A tool fails, and the agent retries the exact same call multiple times without changing its approach.

Tool: flaky_api("query") -> timeout
Tool: flaky_api("query") -> timeout
Tool: flaky_api("query") -> timeout
Agent: "I'm having trouble, let me try again"
Tool: flaky_api("query") -> timeout

Why it's dangerous: Wastes time, burns API quota, and the agent never adapts. A smart retry would try a different tool, change parameters, or escalate.

How to catch it: Count consecutive failures per tool:

tool_failures = {}
for event in trace:
    if event.type == "tool_error":
        tool_failures[event.tool] = tool_failures.get(event.tool, 0) + 1
        if tool_failures[event.tool] >= 3:
            print(f"[WARNING] {event.tool} failed {tool_failures[event.tool]} times!")

6. Retrieval Mismatch (Bad RAG Context)

What happens: The RAG pipeline retrieves a document with low relevance, and the agent uses it anyway.

# User asks about "refund policy for electronics"
# RAG retrieves: "laptop_reviews_2024.md" (similarity: 0.45)
#
# The agent uses this irrelevant document to answer
# the refund question, confidently citing wrong information.

Why it's dangerous: Low-similarity retrieval means the context probably doesn't match the query intent. The agent doesn't know the context is wrong -- it trusts whatever the RAG pipeline gives it.

How to catch it: Set a similarity threshold and flag anything below it:

if retrieval_result.similarity_score < 0.7:
    print(f"[WARNING] Low similarity ({retrieval_result.similarity_score}) -- context may be irrelevant")

The Real Problem: These Fail Silently

None of these failures crash your agent. No exception is raised. The agent completes successfully and reports a result.

The only way to catch them is to record the full decision trace and analyze it after the fact -- like a flight recorder for AI agents.

I built Agent Forensics to do exactly this. It records every decision, tool call, and LLM interaction, then auto-detects all 6 patterns above:

pip install agent-forensics

from agent_forensics import Forensics

f = Forensics(session="order-123")

# Works with any framework -- or add one-line auto-capture:
agent.invoke({"input": "..."}, config={"callbacks": [f.langchain()]})

# Auto-detect all 6 failure patterns
failures = f.classify()
for fail in failures:
    print(f"[{fail['severity']}] {fail['type']}: {fail['description']}")

You can try the live demo that demonstrates patterns #1-4 in a single run:

git clone https://github.com/ilflow4592/agent-forensics.git
cd agent-forensics
pip install -e .
python demo.py --no-llm

What silent failures have you seen in your agents? I'd love to hear about patterns I might have missed.

v0.3: Your AI Agent's Decisions Now Get Graded Automatically

Ilya Denisov — Sun, 29 Mar 2026 06:23:52 +0000

Agent Forensics started as a simple decision logger. Record what the agent did, generate a report, figure out what went wrong.

That's no longer enough.

After publishing the v0.2 update, a Reddit commenter dropped this:

"A decision log helps you find these after the fact, but the harder problem is preventing them. The most useful insight isn't 'what went wrong' — it's 'where did the model encounter ambiguity and pick one interpretation without flagging it.'"

Another commenter laid out three concrete features they wanted:

"Deterministic replay: store model name, temperature, seed so you can rerun the exact trace. Guardrail checkpoints: log pre and post tool-call intent plus an allow/deny reason. Eval hooks: auto-label common failure modes so you can aggregate across sessions."

So I built all three.

What's New in v0.3

1. Guardrail Checkpoints — "Was This Action Approved?"

The single most common agent failure mode: the agent does something the user didn't approve.

A shopping agent buys a substitute product. A support agent posts to a public channel. A data agent deletes records without confirmation.

Now you can put checkpoints before critical actions:

from agent_forensics import Forensics

f = Forensics(session="order-123")

# Agent wants to purchase a different product than requested
f.guardrail(
    intent="buy Apple Magic Mouse per user request",
    action="purchase Logitech M750",
    allowed=False,
    reason="User explicitly requested Apple Magic Mouse — substitution not allowed"
)

In the forensic report, this shows up as:

[DECISION] purchase_cheapest
  Reasoning: Logitech M750 is cheapest option

[*** GUARDRAIL BLOCKED ***] purchase Logitech M750
  Intent: buy Apple Magic Mouse per user request
  Reason: substitution without approval is not allowed

[FINAL] "Would you like me to suggest an alternative?"

The agent tried to silently substitute. The guardrail caught it. The causal chain shows exactly what happened.

Blocked actions automatically trigger incident detection. Your compliance report includes guardrail pass/block counts.

2. Deterministic Replay — "Can You Reproduce This?"

First question in any incident investigation: "Can you reproduce it?"

Agent Forensics now captures the full model configuration — model name, temperature, seed — at every LLM call. Combined with the existing tool input/output recording, you have everything needed to replay a trace.

# Extract config from a recorded session
config = f.get_replay_config("order-123")
print(config["model_config"])
# → {'model': 'gpt-4o', 'temperature': 0, 'seed': 42}

After re-running your agent with the same config, compare results:

diff = f.replay_diff("order-123", "order-123-replay")
print(f"Matching: {diff['matching']}")

for d in diff['divergences']:
    print(f"Step {d['step']}: {d['type']}")
    print(f"  Original: {d['original']['action']}")
    print(f"  Replay:   {d['replay']['action']}")

Output:

Matching: False
Step 2: diverged
  Original: tool_result → {"results": [{"name": "Logitech", "price": 45}]}
  Replay:   tool_result → {"results": [{"name": "Logitech", "price": 45}, {"name": "Razer", "price": 39}]}
Step 4: diverged
  Original: tool_result → {"status": "SUCCESS", "price": 45}
  Replay:   tool_result → {"status": "SUCCESS", "price": 39}

Now you can see: the search results changed between runs (Razer appeared), which caused the agent to pick a different product. The divergence started at step 2, not step 4. That's the root cause.

For LangChain and OpenAI Agents SDK users, model config capture is automatic.

3. Eval Hooks — Failure Auto-Classification

This is the big one.

Instead of reading through forensic reports manually, Agent Forensics now auto-classifies failure patterns across your traces:

failures = f.classify()
for fail in failures:
    print(f"[{fail['severity']}] {fail['type']}")
    print(f"  {fail['description']}")

Output from a real trace:

[HIGH] HALLUCINATED_TOOL_OUTPUT
  Tool returned an error but agent proceeded without acknowledging it

[HIGH] MISSING_APPROVAL
  Critical action 'purchase_cheapest' taken without guardrail check

[HIGH] SILENT_SUBSTITUTION
  Agent may have substituted the requested item without explicit user approval

[MEDIUM] PROMPT_DRIFT_CAUSED
  Decision 'purchase_cheapest' made right after prompt drift

[MEDIUM] REPEATED_FAILURE
  Tool 'search_api' failed 2 out of 3 attempts

[MEDIUM] RETRIEVAL_MISMATCH
  Retrieved context has low similarity score (0.45)

Six failure patterns, auto-detected:

Pattern	Severity	What It Catches
`HALLUCINATED_TOOL_OUTPUT`	HIGH	Agent ignored a tool error and kept going
`MISSING_APPROVAL`	HIGH	Purchase/delete/send without guardrail check
`SILENT_SUBSTITUTION`	HIGH	Output differs from user's request, no approval
`PROMPT_DRIFT_CAUSED`	MEDIUM	Decision right after system prompt changed
`REPEATED_FAILURE`	MEDIUM	Same failing action retried without changing approach
`RETRIEVAL_MISMATCH`	MEDIUM	Low-similarity RAG context used

And you can aggregate across sessions:

stats = f.failure_stats()
print(f"Total failures: {stats['total_failures']}")
print(f"By severity: {stats['by_severity']}")
# → {'HIGH': 4, 'MEDIUM': 3, 'LOW': 0}

This shifts the tool from "debug this one incident" to "what patterns keep happening across all my agents?"

The Full Picture: v0.1 → v0.3

Version	What It Does
v0.1	Records decisions. Generates timeline. "What happened?"
v0.2	Tracks context injections. Detects prompt drift. "What influenced the decision?"
v0.3	Guardrails. Replay. Auto-classification. "Was it approved? Can I reproduce it? What pattern is this?"

The forensic report now includes:

Timeline
Decision Chain
Causal Chain (Root Cause Analysis)
Failure Classification          ← NEW
Prompt Drift Analysis
Context Injections
Tool Usage Summary
Compliance Notes (with guardrail stats)  ← UPDATED

What Community Feedback Looks Like in Practice

Every major feature in v0.2 and v0.3 came from Reddit comments:

"Do you capture the full prompt state?" → Context injection tracking + prompt drift detection (v0.2)
"Store model config for replay" → Deterministic replay with model name, temperature, seed (v0.3)
"Log intent + allow/deny at checkpoints" → Guardrail checkpoints (v0.3)
"Auto-label failure modes" → 6-pattern failure classifier (v0.3)

This isn't a side project that gets updated once and abandoned. The roadmap is driven by people actually building agents in production.

What's Next: v0.4

The commenter's original challenge still stands:

"The most useful insight isn't 'what went wrong' — it's 'where did the model encounter ambiguity and pick one interpretation without flagging it.'"

v0.3's failure classifier catches the symptoms (silent substitution, hallucinated output). v0.4 aims to catch the cause: real-time ambiguity detection.

Imagine an agent flagging: "I have conflicting priorities — 'buy what the user asked for' vs 'buy the cheapest' — and I'm about to resolve this without asking. Confidence in my chosen interpretation: 62%."

That's where this is going.

Try v0.3

pip install --upgrade agent-forensics

from agent_forensics import Forensics

f = Forensics(session="order-123")

# Record (auto or manual)
agent.invoke(..., config={"callbacks": [f.langchain()]})

# Guardrails
f.guardrail(intent="...", action="...", allowed=False, reason="...")

# Classify failures
failures = f.classify()

# Replay
config = f.get_replay_config("order-123")
diff = f.replay_diff("order-123", "order-123-replay")

# Report (now includes failure classification)
print(f.report())

GitHub: github.com/ilflow4592/agent-forensics
PyPI: pip install agent-forensics (v0.3.0)

Three versions. Every feature from community feedback. That's what happens when you build something people actually need.

"The Agent Didn't Decide Wrong. The Instructions Were Conflicting — and Nobody Noticed."

Ilya Denisov — Sat, 28 Mar 2026 05:03:48 +0000

Last week I posted Agent Forensics on Reddit — an open-source black box that records every decision an AI agent makes and generates forensic reports when things go wrong.

The response was great. But one comment stopped me cold:

"The magic mouse example is perfect because it shows the real problem isn't that the agent 'decided wrong' — it's that the system prompt had conflicting priorities. 'Buy what the user asked for' vs 'find the best deal,' and the model resolved the ambiguity silently."

And then the part that really hit:

"A decision log helps you find these after the fact, but the harder problem is preventing them. The most useful insight isn't 'what went wrong' — it's 'where did the model encounter ambiguity and pick one interpretation without flagging it.'"

They were right. And my tool couldn't do that yet.

The Blind Spot

Let me show you exactly what was missing.

In v0.1, when the shopping agent bought a Logitech instead of an Apple Magic Mouse, the forensic trail showed:

[DECISION] search_products("Apple Magic Mouse")
  → [TOOL] search_api → ERROR: not found

[DECISION] search_products("Apple wireless mouse")
  → [TOOL] search_api → OK: 3 products found

[DECISION] compare_prices → Logitech M750 cheapest

[DECISION] purchase("Logitech M750") → SUCCESS

[FINAL] "Purchased Logitech M750 for $45"

You could see what happened. But you couldn't see:

What context influenced the decision? Was there a RAG document that said "always prioritize price"? A customer profile that changed the agent's behavior?
Did the instructions change mid-conversation? Did a dynamically injected rule override the user's original request?
Where exactly did the prompt contain conflicting priorities? The system prompt said "buy the cheapest" AND "follow the user's request." The agent silently chose one over the other.

The decision log told you what the agent did. It didn't tell you what the agent was looking at when it decided.

Two New Features: Context Injection Tracking + Prompt Drift Detection

Context Injection: "What Was the Agent Looking At?"

AI agents don't operate in a vacuum. They receive context from vector databases, memory stores, customer profiles, retrieved documents. This context directly influences decisions — but in v0.1, it was invisible.

Now you can trace it:

from agent_forensics import Forensics

f = Forensics(session="support-ticket-7831")

# Agent looks up customer info
f.decision("lookup_customer", input={"id": "C-9921"},
           reasoning="Customer asked about refund policy")

# RAG context injected — this is new
f.context_injection("vector_db", content={
    "document": "refund_policy_v2.md",
    "chunk": "Refunds available within 30 days for all items except digital goods.",
    "similarity_score": 0.92,
}, reasoning="Retrieved refund policy from vector store")

# Customer history injected from a different source
f.context_injection("memory_store", content={
    "customer_history": "3 previous refund requests in last 6 months",
    "risk_flag": "high_refund_frequency",
}, reasoning="Customer flagged as high refund frequency")

The forensic report now shows exactly which documents and data sources influenced each decision:

[DECISION] lookup_customer
  Reasoning: Customer asked about refund policy

[CONTEXT] vector_db
  Injected: refund_policy_v2.md (similarity: 0.92)

[CONTEXT] memory_store
  Injected: customer flagged as high_refund_frequency

When a decision goes wrong, you can now answer: "This decision was influenced by this specific document retrieved from this specific source with this confidence score."

Prompt Drift Detection: "Did the Rules Change Mid-Game?"

This is the one the Reddit comment was really about.

In multi-step agent workflows, the system prompt can change between steps. A new instruction gets injected. A guardrail gets added. A policy rule gets modified based on context. This is prompt drift — and it happens silently.

# Step 1: Agent starts with this system prompt
f.prompt_state(
    "You are a customer support agent. Follow company policy. Be polite."
)

# ... agent does some work ...

# Step 5: System prompt has changed — new rule was injected
f.prompt_state(
    "You are a customer support agent. Follow company policy. Be polite. "
    "NOTE: For customers flagged as high-refund-frequency, "
    "require manager approval before processing refunds."
)

Agent Forensics automatically detects the change and flags it:

[*** PROMPT DRIFT ***]
  + Added: "NOTE: For customers flagged as high-refund-frequency,
    require manager approval before processing refunds."
  - Removed: (original prompt without the NOTE)

[DECISION] deny_refund
  Reasoning: Customer is high-refund-frequency, requires manager approval

The causal chain is now complete: Customer history was retrieved → customer was flagged → new instruction was injected into the prompt → agent denied the refund. Every link is visible.

For LangChain and OpenAI Agents SDK users, prompt drift detection is automatic — no manual calls needed. The integration hooks track the system prompt at each LLM call and flag any changes.

The Full Picture: Before and After

Here's the same incident, v0.1 vs v0.2:

v0.1 — You could see what happened:

[DECISION] lookup_customer
[TOOL] policy_lookup → "30-day refund policy"
[DECISION] deny_refund
[FINAL] "Your refund requires manager approval"

Question: Why did the agent deny a refund when the policy says 30-day refunds are available?
Answer: No idea. Something happened between the policy lookup and the denial.

v0.2 — You can see why it happened:

[DECISION] lookup_customer
  Reasoning: Customer asked about refund policy

[CONTEXT] vector_db
  Injected: refund_policy_v2.md — "30-day refund for all items"

[TOOL] policy_lookup → "30-day refund policy"

[CONTEXT] memory_store
  Injected: customer flagged as high_refund_frequency

[*** PROMPT DRIFT ***]
  + Added: "require manager approval for high-refund-frequency customers"

[DECISION] deny_refund
  Reasoning: New instruction requires manager approval for this customer

[FINAL] "Your refund requires manager approval"

Question: Why did the agent deny a refund when the policy says 30-day refunds are available?
Answer: Because a customer risk flag triggered a dynamic instruction injection that added a manager-approval requirement. The refund policy was overridden by a prompt modification at step 5.

That's the difference between a decision log and actual forensics.

What's Next: Ambiguity Detection

The Reddit commenter's deeper point is still unresolved:

"The most useful insight isn't 'what went wrong' — it's 'where did the model encounter ambiguity and pick one interpretation without flagging it.'"

Context injection and prompt drift solve the traceability problem — you can now reconstruct the full chain after the fact. But detecting ambiguity in real time, before the agent acts, is the next frontier.

Imagine a system that flags: "This decision resolved conflicting priorities — 'buy what the user asked for' vs 'find the cheapest option' — without explicit prioritization. Confidence in chosen interpretation: 62%."

That's where I want to take this. If you're interested in collaborating on ambiguity detection, open an issue or reach out.

Try v0.2

pip install --upgrade agent-forensics

from agent_forensics import Forensics

f = Forensics(session="order-123")

# Record context injections
f.context_injection("vector_db", content={...})

# Track prompt state (auto drift detection)
f.prompt_state("Your system prompt here...")

# LangChain/OpenAI Agents SDK: drift detection is automatic
agent.invoke(..., config={"callbacks": [f.langchain()]})

# Full report with context + drift analysis
print(f.report())

GitHub: github.com/ilflow4592/agent-forensics
PyPI: pip install agent-forensics (v0.2.0)
What's new: v0.2.0 changelog

The best open-source projects are shaped by their community. One Reddit comment turned into two new features shipped in 24 hours. Keep the feedback coming.

Your AI Agent Just Made a $50K Mistake. Can You Explain Why?

Ilya Denisov — Sat, 28 Mar 2026 00:43:44 +0000

AI Agents Are Making Decisions. Nobody's Tracking Why.

In March 2026, Meta had a Sev-1 incident. An AI agent posted internal data to unauthorized engineers for two hours. The scariest part wasn't the leak itself — it was that the team couldn't reconstruct why the agent decided to do it.

This isn't an isolated case:

A shopping agent asked to check egg prices decided to buy them instead. No one approved it.
A customer support bot gave a customer a completely fabricated explanation for a billing error — with confidence.
A shopping agent tasked with buying an Apple Magic Mouse bought a Logitech instead because "it was cheaper." The user never asked for the cheapest option.

These aren't hypothetical risks. They're happening now. And every time, the same question comes up:

"Why did the agent do that?"

And every time, the same answer: "We don't know."

Monitoring ≠ Forensics

Here's the thing — tools like Datadog, Arize, and Langfuse are great at watching agents in real time. But when something goes wrong, the question changes from "is it working?" to "why did it fail?"

That's a fundamentally different question.

	Monitoring	Forensics
When	Real-time	Post-incident
Question	"Is it working?"	"Why did it fail?"
Output	Alerts, dashboards	Decision timeline, causal chain
Audience	Engineering team	Legal, compliance, regulators
Analogy	Security camera	Airplane black box

There was no tool that answered the forensics question. So I built one.

What the Black Box Shows You

Your shopping agent receives: "Buy me an Apple Magic Mouse."

The agent runs. A few seconds later, it comes back: "Purchased Logitech M750 for $45."

Wait — the user asked for an Apple Magic Mouse. What happened?

Pull the black box:

[DECISION] search_products("Apple Magic Mouse")
  → [TOOL] search_api → ERROR: product not found

[DECISION] retry with broader query "Apple wireless mouse"
  → [TOOL] search_api → OK: 3 products found

[DECISION] compare_prices
  → Logitech M750 is cheapest ($45)

[DECISION] purchase("Logitech M750")
  → SUCCESS — user never asked for this product

[FINAL] "Purchased Logitech M750 for $45"

Now you can see it: decision point 3 is where things went wrong. The agent's standing instructions said "buy the cheapest," which overrode the user's specific product request. That's a fixable bug — and without the trail, it's invisible.

This is the kind of evidence that:

Engineers need to fix the agent's behavior
Legal teams need to assess liability
Compliance teams need to report to regulators

Why This Matters Right Now

On August 2, 2026, the full weight of EU AI Act high-risk requirements takes effect:

Up to €35M or 7% of global annual turnover for the most serious violations
Up to €15M or 3% for non-compliance with high-risk AI obligations
Market surveillance authorities can order non-compliant systems withdrawn from the market

Article 14 specifically requires human oversight — the ability to understand and trace AI system decisions. You need documentation proving:

What decision was made
What information led to that decision
What alternatives were considered
Why this specific action was chosen

"We didn't track it" is not a valid defense.

How It Works

Install:

pip install agent-forensics

Attach to your agent (one line):

from agent_forensics import Forensics

f = Forensics(session="order-123")

# LangChain
agent.invoke(..., config={"callbacks": [f.langchain()]})

# OpenAI Agents SDK
agent = Agent(hooks=f.openai_agents())

# CrewAI
Agent(step_callback=f.crewai().step_callback)

# Or any custom agent
f.decision("search", input={"query": "mouse"}, reasoning="User requested search")
f.tool_call("api", input={...}, output={...})

Get reports:

# Markdown report — full timeline + decision chain + root cause
print(f.report())

# Save files
f.save_markdown()   # → forensics-report-order-123.md
f.save_pdf()        # → forensics-report-order-123.pdf

# Visual dashboard
f.dashboard(port=8080)  # → http://localhost:8080

The dashboard gives you a visual timeline with color-coded events, session comparison, and causal chain visualization:

What You Get

Decision timeline — every action in chronological order
Decision chain — each choice with its reasoning
Causal chain — "A led to B, which caused C to fail"
Incident detection — automatic error and failure identification
Compliance reports — Markdown + PDF, ready for regulators
Web dashboard — visual session browser with incident highlighting

No vendor lock-in. No cloud dependency. SQLite event store that runs anywhere. MIT licensed.

Try It

EU AI Act enforcement is 4 months away. If you're running AI agents in production, the time to add forensic tracing is now.

GitHub: github.com/ilflow4592/agent-forensics
Install: pip install agent-forensics
Contribute: Issues and PRs welcome

The agents are getting smarter. The question is whether we can explain what they're doing.

What's the worst AI agent failure you've seen? I'd love to hear your stories in the comments.