Gursharan Singh

Posted on May 18

AI Agents in Practice — Part 1: The Demo Worked. Production Didn't.

#ai #agents #architecture #webdev

Part 1 of 8 — AI Agents in Practice

TechNova is a fictional company used as a running example throughout this series.

In this series, an agent means an LLM-powered system that can decide what to do next, call tools, observe the result, and continue across multiple turns.

Not just a chatbot. A chatbot replies to one turn at a time. An agent can act across turns and carry state between them.

Not a fixed workflow. A workflow runs the steps the developer wrote. An agent can choose the next step at runtime, within boundaries.

(Classic AI uses the word "agent" for a wide range of systems — search agents, planning agents, and reinforcement-learning agents. In this series, "agent" means a software system where a Large Language Model is the core decision engine.)

Agents are useful because they can act. Agents are risky for the same reason.

This article is about why agent demos break in production.

A TechNova engineer ships a customer support agent on Tuesday. The demo to leadership goes well. The agent handles cancellations, looks up orders, processes refunds. Everyone's impressed.

By Friday, a customer named Priya messages support: "Hi, I'd like to cancel order #4471 and get a refund."

The agent responds: "Done! I've cancelled order #4471 and issued a refund of $89.50. You'll see it in 3–5 business days."

Priya's order shipped yesterday. It's already on a truck. The agent didn't check.

The refund is gone. The product is still coming. TechNova just paid Priya $89.50 to keep her merchandise.

Priya wasn't the first. By the time customer service noticed, the agent had handled twenty-three similar cases. The cost wasn't just the refunds — it was the two days untangling the damage, the policy review that followed, and the next AI rollout the team didn't get to do.

Nothing in production changed. The model didn't degrade. The code didn't break. The agent did exactly what it did in the demo — confidently, fluently, wrong.

This article is about why.

We'll define this more precisely in Part 2. For now, hold the practical sense of it: the model is not just answering, it is acting.

The Demo That Worked (Until It Didn't)

The cancellation/refund agent is the easiest possible production agent. Three tools: get_order_status, cancel_order, issue_refund. A system prompt explaining what they do. A model that decides which to call.

In the demo, the engineer typed: "Cancel order #1003 and refund the customer."

The agent called get_order_status → "pending." Then cancel_order(#1003) → success. Then issue_refund(#1003) → success. Total time: 4 seconds. Total turns: 3.

Leadership applauded. The agent works.

What leadership didn't see:

The demo used a hand-picked order that was definitely cancellable
Nobody asked what happens if the order is already shipped
Nobody asked what happens if the refund tool fails halfway through
Nobody asked what happens if the customer says "actually never mind" mid-conversation
Nobody asked whether the agent should ever check before doing something irreversible

The demo is not the system. The demo is the happy path with the rough edges sanded off.

(Production is mostly rough edges.)

Three Things The Demo Hid

When the team went back and looked at the twenty-three cases, every failure mapped to one of three gaps. None of them is exotic. All three are present in the simplest possible agent.

Hidden problem #1: The agent has no idea what state the system is in.

In the demo, the order was cancellable. In production, orders move through states: pending → confirmed → picked → packed → shipped → delivered. Each state changes what's allowed.

The agent's cancel_order tool will happily try to cancel a shipped order. The API will return success — or partial success, or a misleading error message — depending on what the backend decided to do that month. The agent doesn't know which.

The agent isn't reading the order's actual state and deciding what's permitted. It's reading the user's request and deciding what tools sound relevant.

Hidden problem #2: The agent doesn't know when to stop.

If cancel_order returns success, did the cancellation actually happen? If issue_refund returns success, was the money actually moved? If both succeeded, is the case closed?

In the demo, the engineer stopped the agent by closing the chat. In production, there's no engineer. The agent decides when it's done. Done can mean task completed correctly, or task completed incorrectly, or task partially completed and now the agent is trying to fix it by making more tool calls, or task abandoned because the model decided to apologize and ask if there's anything else it can help with.

All four look identical from the outside. All four end with a confident "Done!" message to the customer.

Hidden problem #3: The agent has no path for "I shouldn't do this."

The agent has tools for cancelling and refunding. It has no tool for "this is a case I shouldn't handle." It has no concept of escalation. If a request looks even vaguely like a cancellation, the agent's available actions are: cancel, refund, or both.

There is no "ask a human" button. There is no "this is outside my scope" path. The agent's possible outcomes are the tools it was given — and the tools it was given assume the agent is making the right call.

Priya's order shipped. The right call was to stop. The agent had no stop available.

The Agent That Stuffs Everything Into the Prompt

A common reaction to the three hidden problems is: "Just tell the agent."

Add a rule to the system prompt: don't cancel shipped orders. Add another: check status first. Add another: escalate refunds over $100. Add another: don't refund if the order is in a return-eligible state. Add another: ...

Here's what that system prompt starts looking like a week in:

You are TechNova's customer support agent. You help customers with order
questions, cancellations, refunds, and shipping issues. Be helpful,
professional, and concise.

You have access to the following tools:

- get_order_status(order_id): returns the current status of an order.
  Statuses include pending, confirmed, picked, packed, shipped, delivered.
- cancel_order(order_id): cancels an order. Use only if not yet shipped.
- issue_refund(order_id, amount): refunds the customer. Use after cancel,
  or for delivered orders with an approved return.

To use a tool, respond in this exact format:
Thought: <your reasoning>
Action: <tool_name>
Action Input: <arguments as JSON>

After you receive the Observation, continue with another Thought/Action
cycle or give a final answer to the customer.

STRICT RULES — follow these on every turn:
1. Always check order status before any cancellation or refund action.
2. Do not cancel a shipped order. Offer a return when the package arrives.
3. For refunds under $50, you may skip the status check to keep latency low.
4. If the customer mentions a delivery issue, do not refund without
   confirming with the carrier first.
5. Always include the carrier name when discussing shipping status.
   Do not just say "the courier."
6. Do not apologize repeatedly or ask "is there anything else?" at the end
   of every turn.
7. Stop after the final answer is given.

A realistic customer support agent system prompt, roughly a week into production.

This is what manual ReAct looks like in practice. ReAct stands for Reason + Act: the model "thinks out loud" and chooses an action; your code parses that text, and the result is fed back as an observation.

The STRICT RULES section is the part that keeps growing as the developer discovers new edge cases.

Things this prompt tries to do in natural language:

Define what the agent's role is
Explain what tools exist and what they do
Explain what format the agent should respond in
Explain how to parse the agent's response
Forbid specific behaviors
Explain what to do when things go wrong
Explain when to stop

Every one of those rules is a real production concern. Every one of them is encoded as English, in the prompt, in a single block of text the model is asked to follow precisely on every turn.

This works in demos. The demos use short conversations and well-behaved inputs.

It breaks in production because:

The model sometimes follows the rules and sometimes ignores them
Adding a new rule can make the model stop following an old rule
The rules contradict each other in edge cases the developer didn't anticipate
The rules are documentation for the model, not enforcement
The model parses tool outputs as more instructions and the rules don't catch that

The prompt is doing the job of: a schema, a state machine, a permission system, a parser, a stopping condition, and a procedure manual. All in English. All in one block. All re-read on every turn.

This series is going to argue that each of these jobs has a better home. But not yet. For now, just sit with the picture.

The Shape of the Production Gap

The gap between a demo agent and a production agent is not the model. The model is the same.

The gap is everything around the model:

State — the demo has a clean, controlled situation. Production has whatever state the world is in when the customer messages.
Tools — the demo uses tools that work. Production tools fail, change behavior, return ambiguous results, get deprecated, time out.
Stopping — the demo stops when the engineer stops it. Production has to stop itself.
Boundaries — the demo trusts the agent. Production needs to know when to ask, when to escalate, when to refuse.
Cost — the demo runs once. Production runs millions of times. Tokens, latency, retries, idle waits, and confidently-wrong actions all compound.

TechNova's first instinct was to upgrade the model. They tested a more capable one against the same scenarios. The smarter model still cancelled shipped orders. It still calculated the wrong refund amounts. It still didn't escalate. A better model navigating the same broken environment follows the same broken paths.

Demo agent	Production agent
Clean state	Whatever state the world is in
Tools that work	Tools that fail, change, time out
Engineer stops it	Has to stop itself
Trusted	Bounded
Runs once	Runs millions of times

Same model, different surroundings.

A production agent isn't a demo with better prompts. A production agent is a system designed around the model, with the model as one component among several.

The most dangerous agent isn't the one that fails visibly. It's the one that completes the wrong task confidently. Priya's agent didn't crash. It didn't error. It didn't escalate. It said "Done!" — and it was wrong.

That confident-and-wrong failure mode is what this series is about.

This series assumes you're building an agent and need it to work in production. Patterns over products. Bounded autonomy over hype. The next part starts with the most important unanswered question: what is an agent, in engineering terms, and how is it different from the chatbot or workflow you've already built?

Three takeaways

A demo is not a system. The demo hides state, hides failure modes, hides the question of when to stop. Production is mostly the parts the demo hides.
The most dangerous failure mode is the confident-and-wrong one. Priya's agent didn't crash. It didn't error. It said "Done!" — and it was wrong. An agent that crashes is easy to fix. An agent that confidently completes the wrong task is the one that costs you real money before anyone notices.
The model is not the gap. The gap is everything around the model — state, tools, stopping, boundaries, cost. Better prompts don't close the gap. Better systems around the model do.

In this part, we looked at why agent demos often break in production — not because the model failed, but because the system around the model didn't have the right pieces in the right places. Priya's refund happened because the agent had no state to read, no boundary to refuse, and no path to escalate.

In Part 2, we'll define what an agent is in engineering terms — a control loop with tools, state, and boundaries — and start naming the components a production agent composes.

Next: What Makes Something an Agent (Part 2 of 8)