Michal Szalinski

Posted on Jun 8 • Originally published at archonhq.ai

Why Your AI Agent Works in Dev and Breaks in Prod

#crucible #ai #devops #agents

Your agent nailed every test case. You shipped it. Within 48 hours, users report hallucinated outputs, silently dropped tool calls, and responses that bear zero resemblance to what worked on your machine. You reload the same prompt locally. It works perfectly. Welcome to the most predictable failure mode in AI engineering: the dev-to-prod gap.

This is Crucible C01. We dissect the five failure modes that kill agents in production and give you the tools to catch them before your users do.

The Idea (60 Seconds)

Developers test agents in idealized conditions: deterministic inputs, warm context windows, generous API latency budgets, and sequential tool calls. Production exposes the opposite environment: cold starts strip context, rate limits compress timing, and parallel calls introduce race conditions. The agent that performed flawlessly at temperature 0 on a 2k-token context window collapses at temperature 0.7 on an 8k-token window.

The five failure modes are temperature drift and context window overflow first; silent API errors and prompt drift follow; race conditions complete the set. Each one has a detection pattern and a fix, and this article delivers both plus the CLI tool to automate the detection.

Why This Matters

AI agent failures differ from traditional software failures in one critical way: they are stochastic. A web API either returns 200 or 500. An AI agent returns something that looks plausible 90% of the time and is catastrophically wrong 10% of the time. That 10% is invisible in manual testing and devastating in production.

The economics compound fast because every failed agent interaction wastes tokens, and wasted tokens cost money. At scale, a subtly broken agent burns budget faster than a working one because it retries, loops, and rephrases instead of succeeding. A single temperature drift bug can double your API spend.

Reliability is the differentiator. The market is flooding with AI wrappers. The ones that survive will be the ones that work consistently, under load, with real user inputs. Crucible exists to make your agent one of the survivors.

Walkthrough

Mode 1: Temperature Drift

Detection pattern: Run the same prompt at your dev temperature and your prod temperature. Hash the outputs. Hash divergence signals drift.

The fix: Pin temperature to 0 in both environments. If you need sampling variance for creativity, isolate it to a single generation step and wrap the rest of the pipeline in deterministic calls. Document the temperature in your agent config file. Treat it like a database connection string: an infrastructure parameter, always explicit, zero room for runtime guesses.

Mode 2: Context Window Overflow

Detection pattern: Instrument your agent to log cumulative token count per conversation. When it crosses 75% of your model’s context limit, flag the conversation. Watch for truncated outputs, repeated phrases, or instructions that the model appears to have forgotten.

The fix: Implement a context compaction strategy. Summarize older turns and replace them with a compressed summary token block. Set hard token budgets per turn and per conversation. When the budget is exhausted, either summarize or start a fresh context window with a recovery prompt that preserves the task state.

Mode 3: Silent API Errors

Detection pattern: Log every API call’s HTTP status code and response body. Count calls that return non-200 statuses. If your agent has retry logic, log whether the retry succeeded. A pattern of failed retries with continued execution signals swallowed errors.

The fix: Treat API errors as hard failures by default. Wrap every API call in a circuit breaker that halts the agent on persistent errors. Log the error, notify the orchestrator, and return a structured failure to the caller. Silent continuation on error state is the single most dangerous production behavior in any agent system.

Mode 4: Prompt Drift

Detection pattern: Version your system prompts. On every agent run, hash the active prompt and compare it to the canonical hash. When outputs diverge between runs, diff the prompt versions first.

The fix: Lock system prompts in version control. Deploy prompt changes through the same review pipeline as code changes. Run regression tests: execute a benchmark suite against the old prompt, then the new prompt, and diff the results. Any change that shifts more than 10% of benchmark outputs requires manual review.

Mode 5: Race Conditions in Parallel Tool Calls

Detection pattern: Enable request-order logging. When your agent dispatches parallel calls, log the dispatch order and the completion order. Any inversion signals a potential race condition.

The fix: Avoid parallel tool calls unless you can guarantee idempotent, order-independent results. When parallelism is necessary, implement a reconciliation step that sorts responses by a sequence token before the agent processes them. Better yet, use a deterministic execution model: serialize all tool calls, accept the latency cost, and gain correctness.

The Prompt Toolkit

1. Agent Failure Analyst Prompt

<role>
You are an Agent Failure Analyst for the Crucible diagnostic framework.
</role>

<input>
  <agent_architecture>
    {{AGENT_ARCHITECTURE_DESCRIPTION}}
  </agent_architecture>
  <failure_scenario>
    {{FAILURE_SCENARIO_DESCRIPTION}}
  </failure_scenario>
</input>

<task>
Analyze the agent architecture against the failure scenario. Identify which of the six failure modes are present or likely:

1. temperature_drift , Dev and prod temperature settings diverge.
2. context_overflow , Token count exceeds model context limit.
3. token_limit , Response truncation due to max_tokens ceiling.
4. prompt_drift , System prompt edits propagate uncontrolled cascading changes.
5. api_latency , Timeouts or rate limits cause silent failures.
6. race_condition , Parallel tool calls return out of order.

For each identified failure mode, provide:
- evidence: Specific architectural features or scenario details that indicate this mode.
- severity: critical, high, medium, low.
- reproduction_steps: Exact sequence to trigger the failure.
- fix_strategy: Concrete architectural change to eliminate the failure mode.
</task>

<output_format>
<analysis>
  <failure_mode name="..." present="true|false">
    <evidence>...</evidence>
    <severity>...</severity>
    <reproduction_steps>
      <step order="1">...</step>
      <step order="2">...</step>
    </reproduction_steps>
    <fix_strategy>...</fix_strategy>
  </failure_mode>
  <summary>...</summary>
  <priority_fixes>
    <fix order="1">...</fix>
    <fix order="2">...</fix>
  </priority_fixes>
</analysis>
</output_format>

2. agentprobe CLI

The agentprobe command-line tool scans your agent configuration for common failure modes, traces live runs with full instrumentation, diffs two runs to locate divergence points, and replays failed traces to test determinism.

Install and run:

cp agentprobe.py /usr/local/bin/agentprobe
chmod +x /usr/local/bin/agentprobe
agentprobe scan --config agent_config.json
agentprobe trace --config agent_config.json --prompt "Analyze the Q3 report"
agentprobe diff --run-a trace_001.json --run-b trace_002.json
agentprobe replay --trace trace_001.json

Download: agentprobe.py

Caveats

The five failure modes cover the most common production breakdowns, yet they remain an incomplete set. Model-specific quirks, provider-specific rate limit architectures, and custom orchestration logic introduce failure modes unique to your stack. Treat these five as your baseline scan, then extend the detection patterns to match your architecture.

The agentprobe tool instruments API calls and logs token counts, yet it relies on the provider reporting accurate token usage. Some providers approximate. Cross-check token counts against your own tokenizer when precision matters.

Determinism is a spectrum, all-or-zero. Temperature 0 reduces variance dramatically, yet even at temperature 0, some models exhibit minor non-determinism due to floating-point accumulation differences across hardware. Replay results that match 99% of tokens are as good as deterministic for practical purposes.

Philosophy

The Crucible stance: test in conditions that match production, or accept production failures as inevitable. Every shortcut in your testing pipeline compounds into a failure in your production pipeline. Agents are stochastic systems. Stochastic systems demand systematic testing, systematic observation, and systematic repair.

The dev-to-prod gap is avoidable. It requires treating your agent’s non-determinism as a first-class engineering concern, designing for the worst case from day one, and instrumenting everything. The tools in this article automate the detection. The fixes are architectural. The discipline is yours.

Crucible C01 is the first article in the Crucible Series by ArchonHQ. Each article dissects a specific AI agent failure mode and delivers the prompts and tools to eliminate it. Subscribe for full access to the series.

Subscribe now

This article was originally published on ArchonHQ — practical AI that wins every day. Subscribe free to get new articles in your inbox.

Top comments (2)

xulingfeng • Jun 8

"Your agent nailed every test case. You shipped it. Within 48 hours, users report hallucinated outputs."

This is the line every QA engineer reads and flinches at. The hard part isn't catching the 500s — it's that the agent passes the test suite and then still breaks in production with a 200. Traditional pass/fail assertions don't cover stochastic outputs, which means your test suite gives you false confidence right up to the moment it matters.

The temperature pin to 0 is a good start, but I'd argue the real fix is: test the agent's output contract, not just its execution path. A test that checks "did it return JSON" is worthless if the JSON is hallucinated. We've found richer assertions (semantic similarity on critical fields, boundary checks on confidence scores) catch the drift before it reaches users.

Curious — do you see Crucible expanding into output-contract-level assertions, or is the focus staying on execution environment?

Lazypl82 • Jun 8

The dev-to-prod gap holds for one moment, and the next one — the first 10-20 minutes after deploy — usually surfaces something none of the five modes caught in isolation. Temperature was pinned, the context budget was instrumented, the prompts were locked. But the agent is now running against real traffic shape, and the silent failure mode is whether a developer can still tell within minute 8 that something is off, before the cost compounds. The detection patterns you list are great for the build phase. The interesting gap is who's watching the window after the build passes.