DEV Community

SapotaCorp
SapotaCorp

Posted on • Originally published at sapotacorp.vn on

Four forensics when a production AI agent fails

A founder messaged us at 11pm on a Friday: "Our agent is broken. Customers are complaining. My on-call engineer has no idea where to start. Can you help?"

The agent was a customer support tool that had launched the previous Monday. By Friday evening, the company's support inbox had filled with users reporting that the AI was giving wrong answers, taking forever to respond, or just timing out. The engineering team was treating it as one big problem. It was actually four problems stacked on top of each other.

This is the failure pattern most production agent teams hit at some point. The symptoms compound, the team panics, and they start trying random fixes. Here is the forensics order Sapota walks through, and the four most common failure modes that account for the majority of post-launch incidents.

Forensics order: traces first

Before debugging anything else, look at the traces. If your agent is in production without traces, that is the first problem to solve, even mid-incident. Pull a request that is failing, look at the trace, and see where the time is being spent and what is failing.

The pattern we look for in the trace:

  • Where does the request actually fail? A specific tool call? A specific LLM step? A timeout?
  • What does the failure look like? HTTP error? LLM hallucination? Output schema mismatch? Output that passes validation but is wrong?
  • What changed recently? Compare a failing request to a working request from a week ago. What is different?

In the founder's case, the traces showed three different failure patterns appearing in the same week. The team had been treating them as one problem because the customer-facing symptom was the same: "the AI is broken."

Failure mode 1: External dependency degradation

The most common production agent failure is an external dependency getting slower or less reliable. The agent itself is fine; the world around it changed.

Common culprits:

  • LLM provider rate limits. OpenAI or Anthropic starts throttling because your traffic increased past your tier limit. Each request now retries 3 times before succeeding, tripling latency.
  • Retrieval system slowdown. Your vector database is under more load than at launch and the p95 query latency went from 50ms to 800ms.
  • External API drift. A tool you call (CRM API, billing system, search service) had a quiet update that changed response shape or timing.
  • Knowledge base growth. Your KB has tripled in size since launch, and retrieval recall has dropped because you never tuned for the larger corpus.

The diagnostic: check your tool latency and error metrics for the past week. If any tool's p95 latency is 2x what it was at launch, or its error rate is up more than 1%, that is the candidate.

The fix depends on the specific dependency. Rate limits: upgrade your tier or implement exponential backoff. Slow retrieval: tune the index or scale the database. API drift: update the integration. KB growth: re-tune chunking and retrieval parameters.

In the founder's case, the LLM provider had pushed a quiet model update on Wednesday. The new model interpreted the routing prompt slightly differently, causing the agent to loop more often before settling on an answer. Average iterations went from 2.3 to 4.1. Cost and latency both jumped. The fix was a tighter routing prompt with three more few-shot examples.

Failure mode 2: Validation gates that aren't being triggered

The opposite failure: a validation gate is supposed to be catching bad outputs, but it is not firing because the gate logic has a bug or the threshold is wrong.

Common patterns:

  • Faithfulness threshold too low. Set at 0.5, the gate passes responses that are mostly hallucinated. Should be 0.85+ for production.
  • Schema validation that allows nulls. The output schema requires "answer" field but allows it to be empty string. Empty answers ship to users as "I don't know" without the agent realizing it failed.
  • Toxicity filter not loaded. The filter library was supposed to be imported, but a refactor moved the import and now it silently no-ops.
  • PII redaction running on the wrong field. Redacts user input but ships PII through in the response.

The diagnostic: look at a sample of bad responses customers reported. Trace what should have caught them. If a validation gate exists for that failure mode, check whether it actually fired.

In the founder's system, the faithfulness threshold was set at 0.7, which was permissive. We tightened it to 0.85, the rejection rate went from 2% to 9%, and the customer complaints about wrong answers dropped immediately. The "rejected" responses were replaced with honest "I cannot find that in our knowledge base" messages, which users preferred to wrong answers.

Failure mode 3: Cost runaway from edge cases

Production query distribution is different from test distribution. Specific query patterns can be much more expensive than the average, and a few of those can dominate the bill.

The pattern: a small fraction of users (often 1-5%) generate a large fraction of cost (often 30-60%). Either through legitimate complex queries, abuse, or because their input triggers a degenerate code path in the agent.

The diagnostic: pull cost-per-user statistics for the last week. Sort descending. Look at the top 10 users. Are they sending normal queries? Or is one user looping their integration with bad inputs? Or is a specific query class (long documents, malformed input, multi-turn deep into rare topics) eating budget?

The fixes vary:

  • Per-user rate limit and cost cap. Hard limit per user per day. Cuts off abuse without affecting legitimate use.
  • Input length cap. Most LLM cost scales with input tokens. Cap user input at a reasonable max (say, 10k characters) and politely ask them to summarize for longer queries.
  • Query type routing. If a specific query type is expensive, route it to a simpler/cheaper handler when possible. "Generate a comprehensive report" is expensive; route it to async batch processing instead of synchronous chat.
  • Iteration cap per request. Prevent the agent from looping indefinitely on a single request. We default to 5-10 iterations max.

In the founder's system, two users were sending repeated multi-paragraph product comparison requests, generating about 40% of the daily cost between them. We added a per-user daily cost cap and a length limit on inputs. Cost dropped 35% within 48 hours. Neither user complained because both were testing internal features and the cap was generous enough for normal use.

Failure mode 4: Silent quality drift

The hardest failure to detect: nothing is broken, no errors, latency is fine, cost is normal. But the responses are getting worse. Customers complain, the team cannot reproduce, and the dashboards all look green.

Causes:

  • Prompt drift. Engineering edits to the prompt template that look small but change behavior. A removed example, a reworded instruction, a clarification that the LLM interprets differently.
  • Model provider updates. As above, the underlying model can change without you knowing. Quality on your specific use case can shift in either direction.
  • Corpus drift. Your KB has accumulated content that pollutes retrieval. Old documents that should have been deprecated still rank highly. New documents conflict with old documents.
  • Eval set staleness. Your eval set was written six months ago against an older product version. It does not reflect what users actually ask now.

The diagnostic: run your eval pipeline against the current production agent. Compare against the score from launch. If the score has dropped, you have quality drift. If the score is the same but customers are complaining, your eval set has gone stale.

The fix: refresh the eval set. Sample 50-100 actual production queries, write expected answers for each, run the eval, and tune from there. Most teams refresh eval sets quarterly. Teams in fast-moving domains do it monthly.

What we shipped for the founder's incident

The four-hour Friday-night triage:

  1. External dependency: identified the LLM provider's model update, tightened the routing prompt, agent stopped looping. Latency and cost recovered.
  2. Validation gate: tightened the faithfulness threshold from 0.7 to 0.85, rejected responses now return honest "I don't know" instead of hallucinations.
  3. Cost runaway: added per-user daily cost cap of $5, input length limit of 10k characters. Cost dropped 35%.
  4. Quality drift: ran eval, score had dropped from 0.84 to 0.71 since launch. Refreshed eval set with 50 recent production queries, identified three categories of questions the agent was failing, added documentation coverage for those categories. Score returned to 0.86 after a week.

Customer complaints stopped within 72 hours. The team's mood went from "we built a broken thing" to "we built a thing that needs operational rigor we did not anticipate." That second framing is the one that produces a better product.

The recommendation: rehearse the forensics

The founder's team had no playbook for "the agent is broken in production." They were debugging in panic mode, which slowed every step. After the incident, we wrote a one-page runbook with the four failure modes, the diagnostic for each, and the most common fixes.

Six weeks later, when a similar issue happened (a tool API outage), the on-call engineer worked through the runbook, identified the cause in 20 minutes, applied the documented fix, and was done in an hour. No panic, no escalation, no Friday-night call to a consultant.

This is what production agent operations looks like at maturity. Not "nothing ever goes wrong" but "when things go wrong, the team has a known process to find the cause."

If your agent had a launch that went sideways

If your team launched an AI agent and the first few weeks have been more painful than expected, the right intervention is usually a forensic audit, not more development. Most launch issues are not new bugs in the agent code. They are operational gaps that surface only at production scale.

Sapota offers a one-week post-launch audit that walks through traces, validation, dependencies, and quality drift, identifies which of the four failure modes is responsible for which symptoms, and ships fixes plus a runbook for future incidents. We have done this for half a dozen B2B SaaS clients in the first three months after their AI launches.

Reach out via the AI engineering page with a description of what your agent does and what kind of failures you are seeing. The first conversation is usually the diagnostic.

Top comments (0)