DEV Community

Ilya Denisov
Ilya Denisov

Posted on

"The Agent Didn't Decide Wrong. The Instructions Were Conflicting — and Nobody Noticed."

Last week I posted Agent Forensics on Reddit — an open-source black box that records every decision an AI agent makes and generates forensic reports when things go wrong.

The response was great. But one comment stopped me cold:

"The magic mouse example is perfect because it shows the real problem isn't that the agent 'decided wrong' — it's that the system prompt had conflicting priorities. 'Buy what the user asked for' vs 'find the best deal,' and the model resolved the ambiguity silently."

And then the part that really hit:

"A decision log helps you find these after the fact, but the harder problem is preventing them. The most useful insight isn't 'what went wrong' — it's 'where did the model encounter ambiguity and pick one interpretation without flagging it.'"

They were right. And my tool couldn't do that yet.


The Blind Spot

Let me show you exactly what was missing.

In v0.1, when the shopping agent bought a Logitech instead of an Apple Magic Mouse, the forensic trail showed:

[DECISION] search_products("Apple Magic Mouse")
  → [TOOL] search_api → ERROR: not found

[DECISION] search_products("Apple wireless mouse")
  → [TOOL] search_api → OK: 3 products found

[DECISION] compare_prices → Logitech M750 cheapest

[DECISION] purchase("Logitech M750") → SUCCESS

[FINAL] "Purchased Logitech M750 for $45"
Enter fullscreen mode Exit fullscreen mode

You could see what happened. But you couldn't see:

  1. What context influenced the decision? Was there a RAG document that said "always prioritize price"? A customer profile that changed the agent's behavior?
  2. Did the instructions change mid-conversation? Did a dynamically injected rule override the user's original request?
  3. Where exactly did the prompt contain conflicting priorities? The system prompt said "buy the cheapest" AND "follow the user's request." The agent silently chose one over the other.

The decision log told you what the agent did. It didn't tell you what the agent was looking at when it decided.


Two New Features: Context Injection Tracking + Prompt Drift Detection

Context Injection: "What Was the Agent Looking At?"

AI agents don't operate in a vacuum. They receive context from vector databases, memory stores, customer profiles, retrieved documents. This context directly influences decisions — but in v0.1, it was invisible.

Now you can trace it:

from agent_forensics import Forensics

f = Forensics(session="support-ticket-7831")

# Agent looks up customer info
f.decision("lookup_customer", input={"id": "C-9921"},
           reasoning="Customer asked about refund policy")

# RAG context injected — this is new
f.context_injection("vector_db", content={
    "document": "refund_policy_v2.md",
    "chunk": "Refunds available within 30 days for all items except digital goods.",
    "similarity_score": 0.92,
}, reasoning="Retrieved refund policy from vector store")

# Customer history injected from a different source
f.context_injection("memory_store", content={
    "customer_history": "3 previous refund requests in last 6 months",
    "risk_flag": "high_refund_frequency",
}, reasoning="Customer flagged as high refund frequency")
Enter fullscreen mode Exit fullscreen mode

The forensic report now shows exactly which documents and data sources influenced each decision:

[DECISION] lookup_customer
  Reasoning: Customer asked about refund policy

[CONTEXT] vector_db
  Injected: refund_policy_v2.md (similarity: 0.92)

[CONTEXT] memory_store
  Injected: customer flagged as high_refund_frequency
Enter fullscreen mode Exit fullscreen mode

When a decision goes wrong, you can now answer: "This decision was influenced by this specific document retrieved from this specific source with this confidence score."

Prompt Drift Detection: "Did the Rules Change Mid-Game?"

This is the one the Reddit comment was really about.

In multi-step agent workflows, the system prompt can change between steps. A new instruction gets injected. A guardrail gets added. A policy rule gets modified based on context. This is prompt drift — and it happens silently.

# Step 1: Agent starts with this system prompt
f.prompt_state(
    "You are a customer support agent. Follow company policy. Be polite."
)

# ... agent does some work ...

# Step 5: System prompt has changed — new rule was injected
f.prompt_state(
    "You are a customer support agent. Follow company policy. Be polite. "
    "NOTE: For customers flagged as high-refund-frequency, "
    "require manager approval before processing refunds."
)
Enter fullscreen mode Exit fullscreen mode

Agent Forensics automatically detects the change and flags it:

[*** PROMPT DRIFT ***]
  + Added: "NOTE: For customers flagged as high-refund-frequency,
    require manager approval before processing refunds."
  - Removed: (original prompt without the NOTE)

[DECISION] deny_refund
  Reasoning: Customer is high-refund-frequency, requires manager approval
Enter fullscreen mode Exit fullscreen mode

The causal chain is now complete: Customer history was retrieved → customer was flagged → new instruction was injected into the prompt → agent denied the refund. Every link is visible.

For LangChain and OpenAI Agents SDK users, prompt drift detection is automatic — no manual calls needed. The integration hooks track the system prompt at each LLM call and flag any changes.


The Full Picture: Before and After

Here's the same incident, v0.1 vs v0.2:

v0.1 — You could see what happened:

[DECISION] lookup_customer
[TOOL] policy_lookup → "30-day refund policy"
[DECISION] deny_refund
[FINAL] "Your refund requires manager approval"
Enter fullscreen mode Exit fullscreen mode

Question: Why did the agent deny a refund when the policy says 30-day refunds are available?
Answer: No idea. Something happened between the policy lookup and the denial.

v0.2 — You can see why it happened:

[DECISION] lookup_customer
  Reasoning: Customer asked about refund policy

[CONTEXT] vector_db
  Injected: refund_policy_v2.md — "30-day refund for all items"

[TOOL] policy_lookup → "30-day refund policy"

[CONTEXT] memory_store
  Injected: customer flagged as high_refund_frequency

[*** PROMPT DRIFT ***]
  + Added: "require manager approval for high-refund-frequency customers"

[DECISION] deny_refund
  Reasoning: New instruction requires manager approval for this customer

[FINAL] "Your refund requires manager approval"
Enter fullscreen mode Exit fullscreen mode

Question: Why did the agent deny a refund when the policy says 30-day refunds are available?
Answer: Because a customer risk flag triggered a dynamic instruction injection that added a manager-approval requirement. The refund policy was overridden by a prompt modification at step 5.

That's the difference between a decision log and actual forensics.


What's Next: Ambiguity Detection

The Reddit commenter's deeper point is still unresolved:

"The most useful insight isn't 'what went wrong' — it's 'where did the model encounter ambiguity and pick one interpretation without flagging it.'"

Context injection and prompt drift solve the traceability problem — you can now reconstruct the full chain after the fact. But detecting ambiguity in real time, before the agent acts, is the next frontier.

Imagine a system that flags: "This decision resolved conflicting priorities — 'buy what the user asked for' vs 'find the cheapest option' — without explicit prioritization. Confidence in chosen interpretation: 62%."

That's where I want to take this. If you're interested in collaborating on ambiguity detection, open an issue or reach out.


Try v0.2

pip install --upgrade agent-forensics
Enter fullscreen mode Exit fullscreen mode
from agent_forensics import Forensics

f = Forensics(session="order-123")

# Record context injections
f.context_injection("vector_db", content={...})

# Track prompt state (auto drift detection)
f.prompt_state("Your system prompt here...")

# LangChain/OpenAI Agents SDK: drift detection is automatic
agent.invoke(..., config={"callbacks": [f.langchain()]})

# Full report with context + drift analysis
print(f.report())
Enter fullscreen mode Exit fullscreen mode

The best open-source projects are shaped by their community. One Reddit comment turned into two new features shipped in 24 hours. Keep the feedback coming.

Top comments (0)