Wassim Chegham

Posted on May 18

Shipping AI Agents Like A Pro

#ai #production #agents #devops

Your agent works on your laptop. It plans a beautiful 4-day hiking trip with a fancy dinner, stays under budget, and nails the itinerary. You hit enter, lean back, and feel like a wizard.

Now ship it to 10,000 users. ...Still confident?

This is the final post in the series, and it's the one that ties everything together. We've spent four posts building up the pieces — failure modes, Agentic RAG, MCP, design patterns — and now we're going to talk about actually shipping this thing. Because the gap between a demo and production isn't features or model size. It's engineering discipline.

Most agents ship without idempotency, validation, budgets, or tracing. They work in the happy path and crumble everywhere else. Cool demos need hardening. Let's harden.

The Reference Architecture: Putting It All Together

Before we get to the checklist, let's zoom out and look at the full picture. Here's a reference architecture for a production-grade multi-agent system — the kind of thing our travel-planning agent would actually run on.

Let's break this down.

The Central Orchestrator is the brain. When a user request comes in, it kicks off the process — invokes the Router to classify the request, dispatches work to specialist agents, and manages the baton-passing between them until a final result is ready. You can implement this with frameworks like LlamaIndex, LangChain, Semantic Kernel, or plain custom code. The framework doesn't matter as much as the discipline.

MCP Servers sit on the right side. Each tool — flight search, weather API, hotel booking, database, restaurant lookup — runs as its own service. They can be written in different languages, maintained by different teams, deployed independently. They all speak the MCP protocol, so the orchestrator talks to them in a consistent way. We covered this in depth in Post 3.

The Observability Layer is integrated throughout. Every agent step, every tool call, every decision gets logged, traced, and measured. Traces show the end-to-end path of each request. When something breaks at step 7 of 12, you don't guess — you look at the trace.

All four design patterns from Post 4 — router, specialists, PES, supervisor — slot into this architecture without coupling. They're conceptual building blocks, not framework-specific features.

Reference: The Azure AI Travel Agents sample implements many of these ideas. It's a great starting point — just remember it's a demo, not production-ready as-is.

The Production Checklist

Here's the checklist. Every item maps back to a failure mode we discussed in Post 1. Nothing here is theoretical — these are the things that catch fire when you skip them.

1. Idempotent Tools and Retries

Make your tools handle being called multiple times without side effects. This is non-negotiable.

In our travel agent, imagine the flight booking tool gets called, the API responds, but the network drops the response. The agent doesn't know if it worked. It retries. If your tool isn't idempotent, congratulations — you just booked two flights to Patagonia.

Idempotent tools fix this. The tool recognizes the duplicate request (via a request ID, a deduplication key, or a check-before-write pattern) and returns the existing result instead of creating a new one.

Pair this with retries and exponential backoff. When a tool call fails or times out, don't just give up — retry with increasing delays. This alone dramatically improves reliability across multi-step workflows by handling the transient errors that are inevitable in any distributed system.

2. Schema Validation and Budgets

Use clear schemas for the data moving between steps and tools. Before calling a tool, validate that the required information is present and correctly formatted.

For our travel agent, before booking a flight, check:

Do we have confirmed dates?
Do we have a destination?
Does it fit the budget?
If any of these are missing — stop and get that info first.

This is the validation loop from Post 2 in action. Don't let the agent barrel forward with incomplete data.

Then set budgets — hard limits on:

Maximum number of steps
Maximum tokens consumed
Maximum execution time
Maximum number of tool calls

Budgets prevent runaway agents. They stop the infinite loops. They stop the token burn. They're the guardrails that keep your agent from driving off a cliff while cheerfully telling you about restaurant options.

3. Full Workflow Tracing

Instrument your agent and tool calls for end-to-end tracing. One user request generates many internal steps. You need to see all of them.

Here's what a single request trace looks like for our travel agent:

When something fails at step 7, you dive into the trace and see exactly where and why. Was it the tool that timed out? Did the specialist get bad data? Did validation catch something it shouldn't have?

Use OpenTelemetry for tracing and metrics. Instrument per-node and per-tool. Make traces a first-class part of your system, not an afterthought.

4. Production-Ready Systems Mindset

This isn't a checklist item you can tick off — it's the mindset that makes all the other items happen. Treat your agent as a secure, testable, monitorable component with clear interfaces. Not a magic prompt.

Each checklist item maps directly to a failure mode:

Checklist Item	Failure Mode It Addresses
Schema validation & budgets	State drift, runaway agents
Timeouts & retries	API timeouts, partial failures
Idempotent tools	Double-execution, inconsistent state
Full tracing	Debugging blind spots, compounding errors
Budget limits	Token burn, infinite loops

Systematically checking these items ensures your agent is reliable, not just smart.

The Quick-Reference Checklist

Here's the scannable version. Print it. Pin it. Tape it to your monitor.

#	Item	Why It Matters	How to Implement
1	Idempotent tools	Prevents double-execution on retries	Request IDs, deduplication keys, check-before-write
2	Retries with backoff	Handles transient failures gracefully	Exponential backoff, jitter, max-retry limits
3	Schema validation	Catches bad data before it propagates	JSON Schema, Pydantic, Zod — validate at every boundary
4	Budget limits	Stops runaway agents	Cap steps, tokens, time, and tool calls
5	End-to-end tracing	Makes debugging possible	OpenTelemetry, correlation IDs across all steps
6	Per-tool telemetry	Pinpoints slow or failing tools	Latency histograms, error rates, call counts
7	Graceful degradation	Keeps partial results useful	Fallback strategies, partial-response handling
8	Supervisor loop	Prevents infinite loops and drift	Explicit termination conditions, step counters

Top 3 Takeaways

If you remember nothing else from this entire series, remember these three things:

Schema validation between steps. Don't let the agent run with missing or malformed data. Validate at every boundary.
Timeouts and retries for tool calls — and make your tools idempotent so retries are safe.
Trace everything. When your agent fails at step 7 of 12, you need to see the full path, not just the final error.

These three practices turn an AI agent from a fragile demo into a solid system.

Series Wrap-Up

We've covered a lot of ground across five posts. Here's the arc:

Why AI Agents Fail in Production — The compounding error problem, the reliability tax, and why 95% per-step accuracy gives you 60% end-to-end success over 10 steps.
Agentic RAG: Smarter Retrieval for Smarter Agents — Moving from "always retrieve everything" to conditional, intentional retrieval with validation loops.
MCP: Treating Tools as Contracts — Separating tool implementation from agent reasoning, enabling multi-team development and independent scaling.
4 Design Patterns That Make AI Agents Actually Reliable — Router, Specialist Agents, Plan-Execute-Summarize, and the Supervisor Loop.
The Production Checklist ← You are here — Idempotency, validation, budgets, tracing, and the systems mindset that ties it all together.

The core message across all five posts is simple:

A good agent is a system, not a prompt.

The mental model that should stick with you: tool calls are distributed system calls. They fail, they time out, and they return partial results — just like any backend service. Once you internalize that, everything else follows naturally. You add retries because calls fail. You add validation because data gets corrupted. You add tracing because you can't debug what you can't see. You add budgets because distributed systems can run away.

That's it. That's the whole series.

Thanks for sticking with me through all five posts. I hope these ideas save you some production incidents — or at least help you fix them faster when they inevitably happen.

Now go build agents that don't catch fire.

What's on your production checklist that I missed? Share your thoughts in the comments below!

Top comments (1)

Harjot Singh • Jun 1

i like how you emphasized the importance of engineering discipline in transitioning from demo to production. it's so true that without proper hardening, even the best ideas can fall apart. on a related note, if you're ever looking to quickly deploy an app, moonshift can help you get a next.js + postgres + auth build live in about 7 minutes, with code you own on github. let me know if you want to run a free build to see how it works.