Three months ago, I deployed an AI agent to production. Today, it handles 50,000+ messages monthly with zero downtime. But here's the thing - none of the tutorials prepared me for what actually happened.
Everyone shows you the shiny "hello world" chatbot. Nobody shows you what happens when real users spam your API at 3 AM, or when your LLM decides to hallucinate customer data.
This is that story.
The Promise vs. The Reality
What tutorials show you:
# The "perfect" AI agent
agent = AIAgent(model="gpt-4")
response = agent.chat("Hello!")
print(response) # Magic! ✨
What production looks like:
graph TB
A[User Message] --> B{Rate Limiter}
B -->|Allowed| C[Queue System]
B -->|Blocked| D[429 Response]
C --> E{Health Check}
E -->|Healthy| F[AI Agent]
E -->|Degraded| G[Fallback Handler]
F --> H{Response Validator}
H -->|Valid| I[User]
H -->|Hallucination| J[Retry Logic]
G --> I
J --> F
Notice the difference? Production AI agents need six layers of protection that tutorials never mention.
The Five Hard Truths About Production AI Agents
1. Rate Limiting Isn't Optional - It's Survival
The tutorial way:
# YOLO approach
while True:
message = get_message()
response = ai_agent.process(message)
The production way:
from collections import defaultdict
from datetime import datetime, timedelta
class AdaptiveRateLimiter:
def __init__(self, base_limit=100):
self.limits = defaultdict(lambda: {"count": 0, "reset": datetime.now()})
self.base_limit = base_limit
def check_limit(self, user_id: str, risk_score: float) -> bool:
"""Adaptive rate limiting based on user behavior"""
limit_data = self.limits[user_id]
# Reset window
if datetime.now() > limit_data["reset"]:
limit_data["count"] = 0
limit_data["reset"] = datetime.now() + timedelta(hours=1)
# Adjust limit based on risk
adjusted_limit = int(self.base_limit * (1 - risk_score))
if limit_data["count"] >= adjusted_limit:
return False
limit_data["count"] += 1
return True
Why it matters: In month one, I blocked 2,847 abuse attempts. Without rate limiting, that's $500+ in wasted API calls.
2. LLMs Hallucinate - Always Validate Output
This one hurt. A user asked for their account balance. The AI agent confidently responded: "Your balance is $127,549.32"
Actual balance? $47.15
The fix:
import re
from typing import Optional
class ResponseValidator:
def __init__(self):
# Patterns that should NEVER appear in responses
self.forbidden_patterns = [
r'\$[\d,]+\.\d{2}', # Dollar amounts
r'\b\d{3}-\d{2}-\d{4}\b', # SSN
r'\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b', # Emails
]
def validate(self, response: str, user_context: dict) -> Optional[str]:
"""Validate AI response against business rules"""
# Check for hallucinated data
for pattern in self.forbidden_patterns:
if re.search(pattern, response, re.IGNORECASE):
return None # Reject response
# Verify facts against database
if "balance" in response.lower():
claimed_balance = self.extract_balance(response)
actual_balance = user_context.get("balance")
if claimed_balance and abs(claimed_balance - actual_balance) > 0.01:
return None # Hallucination detected
return response
Result: Zero incidents of hallucinated financial data in production.
3. Context Window Management Is an Art
Here's what nobody tells you: managing conversation context at scale is harder than building the agent itself.
from collections import deque
from dataclasses import dataclass
from typing import List
@dataclass
class Message:
role: str
content: str
tokens: int
importance: float # 0-1 score
class SmartContextManager:
def __init__(self, max_tokens: int = 4000):
self.max_tokens = max_tokens
self.messages = deque()
def add_message(self, message: Message):
self.messages.append(message)
self._trim_context()
def _trim_context(self):
"""Keep most important messages within token limit"""
total_tokens = sum(m.tokens for m in self.messages)
if total_tokens <= self.max_tokens:
return
# Sort by importance, keep system prompts
sorted_msgs = sorted(
[m for m in self.messages if m.role != "system"],
key=lambda x: x.importance
)
# Remove least important until we fit
while total_tokens > self.max_tokens and sorted_msgs:
removed = sorted_msgs.pop(0)
self.messages.remove(removed)
total_tokens -= removed.tokens
This saved me ~$1,200/month in API costs by intelligently pruning conversation history.
4. Monitoring Needs to Be Obsessive
Metrics that actually matter:
pie title "What Breaks AI Agents in Production"
"Rate Limit Abuse" : 35
"LLM Timeouts" : 25
"Hallucinations" : 20
"Network Issues" : 15
"Database Locks" : 5
My monitoring stack:
from dataclasses import dataclass
from datetime import datetime
import logging
@dataclass
class AgentMetrics:
timestamp: datetime
response_time_ms: float
tokens_used: int
cost_usd: float
user_satisfaction: float
error_type: Optional[str]
def log(self):
logging.info(
f"agent_response",
extra={
"duration_ms": self.response_time_ms,
"tokens": self.tokens_used,
"cost": self.cost_usd,
"satisfaction": self.user_satisfaction,
"error": self.error_type
}
)
class AgentMonitor:
def __init__(self):
self.metrics = []
self.alerts = {
"high_latency": 2000, # ms
"low_satisfaction": 0.6, # 0-1
"error_rate": 0.05 # 5%
}
async def track_request(self, request_fn):
start = datetime.now()
error = None
try:
result = await request_fn()
satisfaction = self.calculate_satisfaction(result)
except Exception as e:
error = str(e)
raise
finally:
duration = (datetime.now() - start).total_seconds() * 1000
metric = AgentMetrics(
timestamp=datetime.now(),
response_time_ms=duration,
tokens_used=getattr(result, 'tokens', 0),
cost_usd=self.calculate_cost(result),
user_satisfaction=satisfaction if error is None else 0,
error_type=error
)
metric.log()
self.check_alerts(metric)
5. Fallbacks Save Your Reputation
The moment of truth: Your AI provider goes down at 2 AM. What happens?
Bad approach:
# Hope and pray
response = openai.ChatCompletion.create(...)
Production approach:
from typing import List, Callable
import asyncio
class AIAgentWithFallbacks:
def __init__(self):
self.providers = [
self.primary_ai, # OpenAI GPT-4
self.secondary_ai, # Anthropic Claude
self.rule_based, # Template responses
self.human_handoff # Last resort
]
async def get_response(self, message: str, max_retries: int = 3) -> str:
"""Try providers in order until success"""
for provider in self.providers:
for attempt in range(max_retries):
try:
response = await provider(message)
if self.is_valid_response(response):
return response
except Exception as e:
logging.warning(f"{provider.__name__} failed: {e}")
await asyncio.sleep(2 ** attempt) # Exponential backoff
continue
# All providers failed
return "I apologize, but I'm having technical difficulties. A human agent will assist you shortly."
Stats from production:
- Primary provider uptime: 99.2%
- Fallback triggers: 124 times/month
- User complaints about downtime: 0
The Architecture That Actually Works
After three months of iteration, here's the stack:
graph LR
A[User] --> B[Load Balancer]
B --> C[API Gateway]
C --> D{Rate Limiter}
D --> E[Message Queue]
E --> F[Agent Pool]
F --> G[Primary AI]
F --> H[Fallback AI]
F --> I[Rules Engine]
G --> J[Validator]
H --> J
I --> J
J --> K[Response Cache]
K --> A
L[Monitor] -.-> F
L -.-> G
L -.-> H
M[Database] -.-> F
Key components:
- Load balancer - Distributes traffic
- Rate limiter - Protects against abuse
- Message queue - Handles spikes
- Agent pool - Scales horizontally
- Validator - Catches hallucinations
- Cache - Reduces costs 40%
- Monitor - Real-time alerts
Real Numbers After 3 Months
| Metric | Value |
|---|---|
| Total messages | 52,847 |
| Avg response time | 847ms |
| Uptime | 99.97% |
| Cost per message | $0.034 |
| User satisfaction | 4.6/5.0 |
| Hallucinations caught | 38 |
| Abuse attempts blocked | 2,847 |
| Fallback activations | 124 |
What I'd Do Differently
If I started over today:
- ✅ Start with rate limiting - Day 1, not day 30
- ✅ Build monitoring first - You can't fix what you can't see
- ✅ Plan for hallucinations - They WILL happen
- ✅ Design fallbacks early - Don't wait for an outage
- ✅ Cache aggressively - 40% cost reduction, zero effort
What worked perfectly:
- SQLite for conversation history (yes, SQLite in production)
- Bun for API server (3x faster than Node)
- Simple rule-based fallbacks (saved my reputation twice)
The Code (Open Source)
Want to see the actual implementation? I open-sourced the core components:
GitHub: github.com/Richardmsbr/atlas-ai-chat
Includes:
- Rate limiter with adaptive limits
- Response validator
- Context manager
- Fallback system
- Monitoring stack
Questions for You
I'm curious about your experience:
- What's your biggest AI agent production challenge?
- Have you dealt with hallucinations? How did you handle it?
- What's your monitoring strategy?
Drop your answers below - I respond to every comment.
Want the full deep dive? I wrote a complete Portuguese version with more code examples on my blog: blog.sakaguchi.ia.br/blog/ai-agents-producao-realidade
Connect with me:
- GitHub: @Richardmsbr
- Building AI agents at scale
- Solutions Architect focusing on production AI systems
Images: Unsplash
Top comments (0)