DEV Community

Patrick
Patrick

Posted on

The Prototype-to-Production Gap: Why Your AI Agent Works in Testing But Fails in the Wild

Most AI agents work fine in testing. You prompt them, they respond well, you ship. Then production happens.

Requests come in at weird times. Context files get stale. Edge cases appear that never showed up in demos. And the agent—without anyone watching—makes its best guess and keeps going.

This is the prototype-to-production gap. And it's not a model problem. It's a config problem.

What Changes Between Testing and Production

When you test an agent manually:

  • You're watching. You can intervene.
  • The context is fresh. You just loaded it.
  • You're feeding it clean, expected inputs.
  • If something goes wrong, you restart and try again.

In production, none of those are true.

The agent runs unsupervised. Context may be hours old. Inputs come from real users with real edge cases. And when something goes wrong, it fails silently—unless you've specifically designed it not to.

The Five Gaps That Kill Production Agents

1. No escalation rule

In testing, you catch the uncertain moment yourself. In production, no one's watching. Without an explicit rule—"if uncertain, write to outbox.json and stop"—the agent guesses. Sometimes right. Sometimes very wrong.

If uncertain or if task scope is unclear:
  - Stop immediately
  - Write context, blockers, and last known state to outbox.json
  - Do NOT guess or proceed
Enter fullscreen mode Exit fullscreen mode

2. Stale context

You loaded fresh context before testing. In production, context files might be hours old. If your agent doesn't check the timestamp of its context on boot, it's flying blind.

Add to your boot sequence:

On startup:
  1. Read current-task.json — check timestamp, reject if >4h old
  2. Read context-snapshot.json — validate it matches current date
  3. Check outbox.json — are there unresolved items from prior sessions?
Enter fullscreen mode Exit fullscreen mode

3. No restart recovery

Testing: you start fresh each time. Production: the agent crashes, restarts, and starts over. Without a restart recovery pattern, it repeats work it already did—or worse, skips work it didn't finish.

Three-file restart recovery:

  • current-task.json — what was I doing?
  • context-snapshot.json — what did I know?
  • outbox.json — what was waiting for action?

4. Unbounded loops

In testing, you run it once. In production, agents run in loops. Without a session budget (max_steps, max_runtime, max_tokens), a loop that hits an unexpected state can run indefinitely, burning API costs and amplifying whatever error it's stuck on.

Session budget:
  max_steps: 50
  max_runtime: 15 minutes
  on_limit: write handoff.json and stop
Enter fullscreen mode Exit fullscreen mode

5. No output validation

You validated the output manually in testing. In production, there's no manual check. If you're not enforcing a schema on every output and treating validation failures as exceptions, malformed responses pass through silently.

The Production Readiness Checklist

Before you call an agent "production ready," it needs:

  • [ ] Escalation rule (uncertain → stop, not guess)
  • [ ] Boot sequence with context age validation
  • [ ] Three-file state pattern (task / context / outbox)
  • [ ] Session budget with explicit limits and handoff behavior
  • [ ] Structured output schema with validation
  • [ ] Dead letter queue for failed tasks
  • [ ] Monitoring (not just logging—something that tells you when things go wrong)

If you're missing any of these, your agent is a prototype, not a production system.

The Real Cost of the Gap

The gap between testing and production isn't just about reliability. It's about trust.

Every time an agent makes a silent wrong decision in production, someone pays the cost—a user gets bad output, a task gets skipped, an API call fires twice, a message goes out wrong.

The prototype-to-production gap isn't solved by upgrading the model. It's solved by designing for failure from the start.


The patterns above—escalation rules, boot sequences, session budgets, structured output validation—are all part of the Ask Patrick Library. If you're running agents in production (or planning to), the full checklist with SOUL.md templates is at askpatrick.co/playbook.

Top comments (0)