Saurav Bhattacharya

Posted on Jun 11

The Reason Your Agent Demo Isn't in Production Has Nothing to Do With the Model

#ai #agents #observability #testing

Your agent demo took an afternoon. The reason it isn't in production nine months later has nothing to do with the model.

I've watched this play out at four companies now. Someone wires up a tool-calling loop, points it at a slick use case, and records a screen capture where the agent books a meeting, queries a database, and writes a summary—all in one clean pass. Leadership is thrilled. A roadmap appears. And then the thing quietly never ships, or it ships and gets rolled back within a month.

The demo-to-production gap is not a model-quality gap. GPT-class models are more than good enough for most agentic work today. The gap is an engineering discipline gap, and pretending otherwise is why so many "AI initiatives" stall. Here's what actually separates a demo agent from a production agent.

A demo runs once. Production runs ten thousand times.

The single most misleading property of a demo is that you only have to see it work once. You run it until you get the clean take, and that take becomes the truth in everyone's head.

Production agents are graded on the tail. If your agent succeeds 92% of the time, that sounds great until you do the math: 8% failure across 10,000 daily runs is 800 broken interactions, every single day. And agent failures aren't independent dice rolls—they cluster. A schema change, a rate limit, a slightly reworded prompt from a user, and your failure rate spikes to 40% for an hour before anyone notices.

The demo mindset optimizes the mean case. Production lives or dies on the p95 and p99.

A demo has a happy path. Production has an adversary called "reality."

In the demo, the API returns 200. The user phrases the request the way you rehearsed. The retrieved document actually contains the answer. Nothing is null.

Production hands your agent:

Tool calls that time out, return partial data, or 500 intermittently
Users who paste 8,000 tokens of irrelevant context
Retrieval that returns three documents, none of which answer the question
Its own previous turn, which was subtly wrong, now poisoning the context

A demo agent treats the model output as the product. A production agent treats the model output as an untrusted input that must be validated before it touches anything real. That single mental shift changes how you write every line.

// Demo agent: trust the model, execute the tool call.
const decision = await llm.complete(prompt);
const result = await tools[decision.tool](decision.args);

// Production agent: the model proposes, your code disposes.
const decision = await llm.complete(prompt);

const check = validateToolCall(decision, {
  allowedTools: ["search", "read_record"], // no writes from this path
  argSchema: toolSchemas[decision.tool],    // structural validation
  budget: { maxToolCalls: 6, used: turnState.toolCalls },
});

if (!check.ok) {
  // Don't crash, don't blindly retry. Record and degrade.
  logViolation(check.reason, decision, turnState.traceId);
  return safeFallback(check.reason);
}

const result = await withGuards(
  () => tools[decision.tool](check.sanitizedArgs),
  { timeoutMs: 4000, retries: 1, idempotencyKey: turnState.traceId },
);

That validateToolCall boundary is the part the demo skips entirely. It's also 70% of the work.

A demo has no memory of what it did. Production needs a flight recorder.

When a demo fails, you shrug and run it again. When a production agent fails at 3 a.m. and a customer files a ticket, "run it again" is not an incident response. You need to answer: what exactly did the agent see, decide, and do, in order?

That means every turn needs a trace: the input, the retrieved context, the raw model output, the validated decision, the tool result, and the final action. Without it, debugging an agent is archaeology. With it, debugging is a query.

type AgentTrace = {
  traceId: string;
  step: number;
  input: string;
  retrieved: { id: string; score: number }[];
  modelOutput: string;        // raw, before parsing
  decision: ParsedDecision;   // after validation
  toolResult?: unknown;
  latencyMs: number;
  violation?: string;         // why a guard fired, if it did
};

function recordStep(trace: AgentTrace) {
  // Structured, queryable, sampled at 100% for failures.
  emit("agent.step", trace);
}

Notice that modelOutput is stored raw. The most common production bug is a parsing or validation step silently mangling a perfectly good model response—and you can only catch that if you kept the original.

A demo measures vibes. Production measures regressions.

Here's the question that kills most agent projects: "We changed the prompt—is it better or worse now?"

In a demo culture, the answer is "it felt better in the three cases I tried." That is not an answer. Every prompt tweak, model upgrade, or tool change is a deploy, and every deploy can regress behavior you fixed two months ago. If you can't run a fixed suite of representative cases and get a number, you are flying blind and shipping on intuition.

This is the discipline gap in one sentence: demo agents are tested by their authors; production agents are tested by a suite that doesn't care about anyone's feelings. You need a corpus of real inputs (including the weird ones from your traces), graded with a mix of deterministic checks for the things that must be exactly right and judged checks for the things that are fuzzy. Run it on every change. Track the score over time. Treat a drop as a build failure, not a discussion topic.

The uncomfortable summary

The reason your agent demo isn't in production is that a demo is a prototype of the happy path, and production is everything else. The model was never the bottleneck. The validation boundary, the trace, the budget enforcement, the regression suite—that unglamorous 70%—is the actual product. The model is just the part that's fun to demo.

If you internalize one thing: stop treating model output as your result and start treating it as an untrusted proposal your own code is responsible for.

When you're ready to build the regression suite, agent-eval gives you the tiered deterministic + judge harness so prompt changes stop being vibes-based, and AgentLens gives you the structured traces so a 3 a.m. failure is a query instead of an excavation. Wire those in before the demo gets a roadmap, not after it gets rolled back.

Top comments (2)

Alex Shev • Jun 11

This matches what I see too. The gap between demo and production is usually not model intelligence; it is the missing operating system around the agent.

A production agent needs clear task boundaries, recovery paths, audit logs, permissions, evaluation, and a human checkpoint for ambiguous outcomes. Without those, the model can solve the happy path and still be too risky for real work.

Adam Lewis • Jun 11

I've watched the same pattern at a smaller scale. The demo is one clean pass, production is a thousand slightly wrong ones. We started writing the acceptance checks before extending the agent's autonomy, so every new tool or step arrives with something that can fail loudly. The stalled projects I've seen mostly did it the other way round, capability first and verification as a clean-up job that never got scheduled.