Your AI agent works perfectly in testing. Then you deploy it to production and everything falls apart.
Sound familiar? You're not alone. A recent survey found that 73% of AI projects never make it past the prototype stage, and the #1 reason cited is reliability — agents that work in demos but fail under real-world conditions.
After deploying dozens of AI agents across production environments, here's a battle-tested framework for making agents that actually work when it matters.
The Production Readiness Checklist
Before any agent goes live, it needs to pass every item on this list:
- [ ] Graceful degradation — when the LLM fails, the system doesn't
- [ ] Timeout enforcement — every API call has a hard timeout
- [ ] Idempotency — retrying the same request produces the same result
- [ ] Observability — every agent action is logged with context
- [ ] Rate limiting — the agent can't DOS itself or external APIs
- [ ] Human escalation — there's always a path to a real person
- [ ] Cost controls — per-customer and per-task spend limits
Miss any one of these and you'll learn about it at 3 AM.
Pattern 1: The Circuit Breaker
Your agent calls an external API. That API goes down. What happens?
Without a circuit breaker: your agent retries forever, accumulating latency and cost. Eventually it times out, but not before burning through your error budget.
With a circuit breaker:
from datetime import datetime, timedelta
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=300):
self.failure_count = 0
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.last_failure_time = None
self.state = "closed" # closed, open, half-open
async def call(self, fn, *args, **kwargs):
if self.state == "open":
if datetime.now() - self.last_failure_time > timedelta(seconds=self.recovery_timeout):
self.state = "half-open" # Try once
else:
raise CircuitOpenError("Circuit is open — failing fast")
try:
result = await fn(*args, **kwargs)
if self.state == "half-open":
self.state = "closed" # We're back
self.failure_count = 0
return result
except Exception as e:
self.failure_count += 1
self.last_failure_time = datetime.now()
if self.failure_count >= self.failure_threshold:
self.state = "open"
raise
Production rule: Every external dependency gets a circuit breaker. No exceptions.
Pattern 2: The Multi-Model Fallback
GPT-4 goes down. Or returns garbage. Or is suddenly 10x slower. Your agent needs to keep working.
# Model fallback chain with automatic switching
MODEL_CHAIN = [
{"model": "gpt-4o", "max_latency_ms": 5000},
{"model": "claude-sonnet-4-20250514", "max_latency_ms": 4000},
{"model": "gpt-4o-mini", "max_latency_ms": 2000},
]
async def call_with_fallback(prompt, **kwargs):
for model_config in MODEL_CHAIN:
try:
result = await call_llm(
model_config["model"],
prompt,
timeout_ms=model_config["max_latency_ms"],
**kwargs
)
log_model_usage(model_config["model"], success=True)
return result
except (TimeoutError, RateLimitError, APIError) as e:
log_model_usage(model_config["model"], success=False, error=str(e))
continue
# All models failed — use cached response or escalate
return await get_cached_response(prompt) or escalate_to_human(prompt)
Production rule: Never depend on a single model provider. Always have a fallback chain.
Pattern 3: Structured Output Validation
LLMs generate text. Your agent needs structured data. The gap between these two things is where most production failures live.
from pydantic import BaseModel, ValidationError
from typing import Optional
class AppointmentBooking(BaseModel):
customer_name: str
date: str
time: str
service_type: str
phone_number: Optional[str] = None
async def book_appointment(agent_response: str) -> AppointmentBooking:
"""Parse agent response into a validated appointment booking."""
try:
parsed = await parse_structured_output(agent_response, AppointmentBooking)
if not is_business_hour(parsed.date, parsed.time):
raise ValueError(f"Booking outside business hours: {parsed.date} {parsed.time}")
return parsed
except ValidationError as e:
# Re-ask the agent with the error context
corrected = await call_llm(
f"The previous response had errors: {e}\n"
f"Please correct and return a valid appointment booking.",
format="json"
)
return await parse_structured_output(corrected, AppointmentBooking)
Production rule: Every LLM output goes through schema validation. Re-ask once on failure, then escalate.
Pattern 4: Cost Guardrails
An AI agent without cost controls is a credit card with no limit. I've seen agents accidentally loop and burn through $500 in a single night.
class CostGuardrail:
def __init__(self, max_cost_per_task=0.50, max_cost_per_customer_daily=5.00):
self.max_cost_per_task = max_cost_per_task
self.max_cost_per_customer_daily = max_cost_per_customer_daily
self.customer_spend = {}
async def check_before_call(self, customer_id, estimated_cost):
daily_spend = self.customer_spend.get(customer_id, 0)
if estimated_cost > self.max_cost_per_task:
raise CostExceededError(
f"Task cost ${estimated_cost:.2f} exceeds limit ${self.max_cost_per_task:.2f}"
)
if daily_spend + estimated_cost > self.max_cost_per_customer_daily:
raise DailyBudgetExceededError(
f"Customer daily spend ${daily_spend:.2f} + ${estimated_cost:.2f} "
f"exceeds limit ${self.max_cost_per_customer_daily:.2f}"
)
async def record_spend(self, customer_id, actual_cost):
self.customer_spend[customer_id] = (
self.customer_spend.get(customer_id, 0) + actual_cost
)
Production rule: Set per-task and per-customer daily limits. Alert on 80% threshold. No exceptions.
Pattern 5: Observability Stack
You can't fix what you can't see. Production AI agents need three layers of observability:
1. Agent-Level Logging
import structlog
logger = structlog.get_logger()
async def agent_action(action_type, context, result, duration_ms, cost_usd):
await logger.ainfo(
"agent_action",
action_type=action_type,
customer_id=context.get("customer_id"),
intent=context.get("intent"),
success=result.get("success"),
duration_ms=duration_ms,
cost_usd=cost_usd,
model=result.get("model"),
tokens=result.get("tokens_used"),
)
2. Business Metrics Dashboard
Track these metrics daily:
- Task completion rate — % of interactions that end successfully
- Human escalation rate — % of interactions routed to humans
- Average cost per task — total LLM spend / completed tasks
- Latency P50/P95/P99 — how fast your agent responds
- Customer satisfaction — CSAT or NPS after agent interactions
3. Alert Rules
ALERT if task_completion_rate < 85% for 15 minutes
ALERT if escalation_rate > 30% for 30 minutes
ALERT if cost_per_task > $0.75 for any hour
ALERT if P99 latency > 10 seconds for 15 minutes
ALERT if error_rate > 5% for any 5-minute window
Pattern 6: The Human Escalation Pathway
AI agents fail. The question isn't if — it's how gracefully.
A proper escalation system has three tiers:
- Auto-retry with correction — agent re-attempts with error context (handles 70% of failures)
- Alternative path — switch to a simpler approach or different model (handles 20% of failures)
- Human handoff — route to a human with full context (handles the remaining 10%)
async def handle_with_escalation(user_message, context):
# Tier 1: Try with the best model
try:
result = await agent.handle(user_message, context)
if result.confidence > 0.85:
return result
except AgentError:
pass
# Tier 2: Try simpler approach
try:
result = await simple_agent.handle(user_message, context)
if result.confidence > 0.7:
return result
except AgentError:
pass
# Tier 3: Human handoff
await notify_human(
message="Agent escalation needed",
context=context,
conversation_history=context.get("history"),
reason="All automated paths failed"
)
return EscalationResult(escalated=True)
The Deployment Checklist
Before pushing any agent to production, verify:
- [ ] Circuit breakers on all external dependencies
- [ ] Multi-model fallback chain configured
- [ ] Output validation with Pydantic schemas
- [ ] Cost guardrails (per-task and per-customer)
- [ ] Structured logging with business context
- [ ] Alert rules configured in monitoring
- [ ] Human escalation pathway tested end-to-end
- [ ] Load tested at 2x expected peak traffic
- [ ] Rollback plan documented and tested
- [ ] On-call runbook written and accessible
Real Results
After implementing these patterns across production agents:
- Task completion rate: 94% (up from 72%)
- Human escalation rate: 6% (down from 28%)
- Average cost per task: $0.08 (down from $0.32)
- P99 latency: 3.2 seconds (down from 12 seconds)
- 3 AM pages: 0 (down from 2-3 per week)
The difference between a demo agent and a production agent isn't intelligence — it's engineering discipline. These patterns turn a fragile prototype into something that runs reliably 24/7.
Want to skip the infrastructure work? The RCS Developer Starter Kit includes production-ready agent templates with circuit breakers, cost guardrails, and monitoring built in — so you can focus on your agent logic, not the plumbing.
Follow @rcsxplatform for more on production AI agent deployment.
Top comments (0)