Your agent demo works. That's the trap.

#ai #programming #machinelearning #softwareengineering

I build AI agents for other companies for a living. The pattern I see most often isn't "the model can't do it." It's "the demo worked, we shipped it, and now it fails one out of every three times and nobody can say why."

That gap between demo and production is mostly arithmetic, and once you internalize the math it changes how you build.

The math nobody puts on the slide

Say each step in your agent is 95% reliable. Sounds great. Now chain ten steps together, which is a modest agent by 2026 standards:

0.95 ^ 10 ≈ 0.60

Sixty percent end-to-end. Stretch it to twenty steps and you're at 36%. And 95% per step is generous. For agents doing real work over messy inputs, per-step error rates land closer to 10–20%. Run the numbers on an 85% step over eight steps:

0.85 ^ 8 ≈ 0.27

About three in four runs fail somewhere. That's not a bad model. That's compounding probability doing exactly what it does.

A demo hides this completely. A demo is one happy path: clean input, short chain, no rate limits, no ambiguous data, you running it five times until it looks good for the recording. Production is a hundred users feeding it garbage you never imagined, on chains that are longer than you think because every "call this tool and parse the result" is really three or four steps under the hood.

Failures are invisible at the step level

Here's the part that actually burns teams. Compounding failure doesn't show up as a crash. Every individual step looks reasonable in isolation.

Step 3 slightly misreads a field. The output is still well-formed JSON. It gets fed to step 4, which reasons confidently from corrupted context, and steps 5 through 8 build on top of that. The final answer is wrong, plausible-looking, and there's no stack trace pointing at step 3. You only find it by tracing the whole causal chain by hand, usually after a customer screenshots something embarrassing.

This is why "the model hallucinated" is the wrong diagnosis most of the time. The model did what it always does — propagate whatever it was handed. The system had no checkpoint, no validation gate, no way to catch the drift at step 3 before it poisoned everything downstream.

The other quiet killer is context. People hear "200K token window" and assume they have 200K tokens of working memory. In practice agents start losing the plot well before that as older instructions get buried under tool output and intermediate junk. Context quality, not context size, is the real limit. A tighter 8K of relevant context beats 80K of noise every time.

What actually moves the needle

None of the fixes are exotic. They're the boring distributed-systems discipline we already know, applied to a non-deterministic worker. The mental shift that matters: stop treating the agent as a prompt, start treating it as a system.

Checkpoint state outside the agent. State lives in a store, not in the conversation. If the process dies at step 6, you resume at step 6, you don't restart the whole chain and pay for it twice. This one change turns "the run failed, start over" into "the run failed, here's exactly where, retry from there."

Validate at the boundaries. Every tool's input and output gets checked against a contract. A schema, a sanity check, an assertion that the number is in range. The goal is to catch the corrupt step-3 output at step 3, where it's a clean recoverable error, instead of at step 8 where it's a mystery. Pydantic-style validation on tool I/O is the cheapest reliability you can buy.

Make side effects idempotent. Retries are non-negotiable with a non-deterministic worker, which means a step can run twice. If a step charges a card or sends an email, an idempotency key is the difference between a retry and an incident. Worth saying out loud: retrying an LLM step is not a cache lookup — the same prompt can return a different answer — so idempotency has to live in the side effect, not the model call.

Put evals in CI. Treat agent behavior like code that can regress, because it does. A prompt tweak that helps one case quietly breaks five others, and without a test set you ship it blind. A modest suite of real cases that runs on every change catches the silent regressions that manual spot-checking never will.

The uncomfortable truth is that going from a slick demo to something you'd put in front of paying users is mostly unglamorous engineering — error handling, state management, observability — not better prompts. At Shanti Infosoft most of our actual work on an agent build is exactly that scaffolding, not the model wrangling people expect. We went deeper on this on our blog: why 40% of AI-agent projects will be dead by 2027.

If you're staring at an agent that demos beautifully and flakes in prod, don't reach for a bigger model first. Open a trace, find the step where the chain quietly goes sideways, and ask why nothing caught it there. Nine times out of ten the answer isn't intelligence. It's that you built a happy path and called it a system.

DEV Community

Your agent demo works. That's the trap.

The math nobody puts on the slide

Failures are invisible at the step level

What actually moves the needle

Top comments (0)