Richard Sakaguchi

Posted on Dec 15, 2025 • Originally published at blog.sakaguchi.ia.br

I Built Production AI Agents That Handle 50K Messages/Month - Here's What the Tutorials Won't Tell You

#ai #python #chatbot #devops

Three months ago, I deployed an AI agent to production. Today, it handles 50,000+ messages monthly with zero downtime. But here's the thing - none of the tutorials prepared me for what actually happened.

Everyone shows you the shiny "hello world" chatbot. Nobody shows you what happens when real users spam your API at 3 AM, or when your LLM decides to hallucinate customer data.

This is that story.

The Promise vs. The Reality

What tutorials show you:

# The "perfect" AI agent
agent = AIAgent(model="gpt-4")
response = agent.chat("Hello!")
print(response)  # Magic! ✨

What production looks like:

graph TB
    A[User Message] --> B{Rate Limiter}
    B -->|Allowed| C[Queue System]
    B -->|Blocked| D[429 Response]
    C --> E{Health Check}
    E -->|Healthy| F[AI Agent]
    E -->|Degraded| G[Fallback Handler]
    F --> H{Response Validator}
    H -->|Valid| I[User]
    H -->|Hallucination| J[Retry Logic]
    G --> I
    J --> F

Notice the difference? Production AI agents need six layers of protection that tutorials never mention.

The Five Hard Truths About Production AI Agents

1. Rate Limiting Isn't Optional - It's Survival

The tutorial way:

# YOLO approach
while True:
    message = get_message()
    response = ai_agent.process(message)

The production way:

from collections import defaultdict
from datetime import datetime, timedelta

class AdaptiveRateLimiter:
    def __init__(self, base_limit=100):
        self.limits = defaultdict(lambda: {"count": 0, "reset": datetime.now()})
        self.base_limit = base_limit

    def check_limit(self, user_id: str, risk_score: float) -> bool:
        """Adaptive rate limiting based on user behavior"""
        limit_data = self.limits[user_id]

        # Reset window
        if datetime.now() > limit_data["reset"]:
            limit_data["count"] = 0
            limit_data["reset"] = datetime.now() + timedelta(hours=1)

        # Adjust limit based on risk
        adjusted_limit = int(self.base_limit * (1 - risk_score))

        if limit_data["count"] >= adjusted_limit:
            return False

        limit_data["count"] += 1
        return True

Why it matters: In month one, I blocked 2,847 abuse attempts. Without rate limiting, that's $500+ in wasted API calls.

2. LLMs Hallucinate - Always Validate Output

This one hurt. A user asked for their account balance. The AI agent confidently responded: "Your balance is $127,549.32"

Actual balance? $47.15

The fix:

import re
from typing import Optional

class ResponseValidator:
    def __init__(self):
        # Patterns that should NEVER appear in responses
        self.forbidden_patterns = [
            r'\$[\d,]+\.\d{2}',  # Dollar amounts
            r'\b\d{3}-\d{2}-\d{4}\b',  # SSN
            r'\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b',  # Emails
        ]

    def validate(self, response: str, user_context: dict) -> Optional[str]:
        """Validate AI response against business rules"""

        # Check for hallucinated data
        for pattern in self.forbidden_patterns:
            if re.search(pattern, response, re.IGNORECASE):
                return None  # Reject response

        # Verify facts against database
        if "balance" in response.lower():
            claimed_balance = self.extract_balance(response)
            actual_balance = user_context.get("balance")

            if claimed_balance and abs(claimed_balance - actual_balance) > 0.01:
                return None  # Hallucination detected

        return response

Result: Zero incidents of hallucinated financial data in production.

3. Context Window Management Is an Art

Here's what nobody tells you: managing conversation context at scale is harder than building the agent itself.

from collections import deque
from dataclasses import dataclass
from typing import List

@dataclass
class Message:
    role: str
    content: str
    tokens: int
    importance: float  # 0-1 score

class SmartContextManager:
    def __init__(self, max_tokens: int = 4000):
        self.max_tokens = max_tokens
        self.messages = deque()

    def add_message(self, message: Message):
        self.messages.append(message)
        self._trim_context()

    def _trim_context(self):
        """Keep most important messages within token limit"""
        total_tokens = sum(m.tokens for m in self.messages)

        if total_tokens <= self.max_tokens:
            return

        # Sort by importance, keep system prompts
        sorted_msgs = sorted(
            [m for m in self.messages if m.role != "system"],
            key=lambda x: x.importance
        )

        # Remove least important until we fit
        while total_tokens > self.max_tokens and sorted_msgs:
            removed = sorted_msgs.pop(0)
            self.messages.remove(removed)
            total_tokens -= removed.tokens

This saved me ~$1,200/month in API costs by intelligently pruning conversation history.

4. Monitoring Needs to Be Obsessive

Metrics that actually matter:

pie title "What Breaks AI Agents in Production"
    "Rate Limit Abuse" : 35
    "LLM Timeouts" : 25
    "Hallucinations" : 20
    "Network Issues" : 15
    "Database Locks" : 5

My monitoring stack:

from dataclasses import dataclass
from datetime import datetime
import logging

@dataclass
class AgentMetrics:
    timestamp: datetime
    response_time_ms: float
    tokens_used: int
    cost_usd: float
    user_satisfaction: float
    error_type: Optional[str]

    def log(self):
        logging.info(
            f"agent_response",
            extra={
                "duration_ms": self.response_time_ms,
                "tokens": self.tokens_used,
                "cost": self.cost_usd,
                "satisfaction": self.user_satisfaction,
                "error": self.error_type
            }
        )

class AgentMonitor:
    def __init__(self):
        self.metrics = []
        self.alerts = {
            "high_latency": 2000,  # ms
            "low_satisfaction": 0.6,  # 0-1
            "error_rate": 0.05  # 5%
        }

    async def track_request(self, request_fn):
        start = datetime.now()
        error = None

        try:
            result = await request_fn()
            satisfaction = self.calculate_satisfaction(result)
        except Exception as e:
            error = str(e)
            raise
        finally:
            duration = (datetime.now() - start).total_seconds() * 1000

            metric = AgentMetrics(
                timestamp=datetime.now(),
                response_time_ms=duration,
                tokens_used=getattr(result, 'tokens', 0),
                cost_usd=self.calculate_cost(result),
                user_satisfaction=satisfaction if error is None else 0,
                error_type=error
            )

            metric.log()
            self.check_alerts(metric)

5. Fallbacks Save Your Reputation

The moment of truth: Your AI provider goes down at 2 AM. What happens?

Bad approach:

# Hope and pray
response = openai.ChatCompletion.create(...)

Production approach:

from typing import List, Callable
import asyncio

class AIAgentWithFallbacks:
    def __init__(self):
        self.providers = [
            self.primary_ai,      # OpenAI GPT-4
            self.secondary_ai,    # Anthropic Claude
            self.rule_based,      # Template responses
            self.human_handoff    # Last resort
        ]

    async def get_response(self, message: str, max_retries: int = 3) -> str:
        """Try providers in order until success"""

        for provider in self.providers:
            for attempt in range(max_retries):
                try:
                    response = await provider(message)
                    if self.is_valid_response(response):
                        return response
                except Exception as e:
                    logging.warning(f"{provider.__name__} failed: {e}")
                    await asyncio.sleep(2 ** attempt)  # Exponential backoff
                    continue

        # All providers failed
        return "I apologize, but I'm having technical difficulties. A human agent will assist you shortly."

Stats from production:

Primary provider uptime: 99.2%
Fallback triggers: 124 times/month
User complaints about downtime: 0

The Architecture That Actually Works

After three months of iteration, here's the stack:

graph LR
    A[User] --> B[Load Balancer]
    B --> C[API Gateway]
    C --> D{Rate Limiter}
    D --> E[Message Queue]
    E --> F[Agent Pool]
    F --> G[Primary AI]
    F --> H[Fallback AI]
    F --> I[Rules Engine]
    G --> J[Validator]
    H --> J
    I --> J
    J --> K[Response Cache]
    K --> A

    L[Monitor] -.-> F
    L -.-> G
    L -.-> H
    M[Database] -.-> F

Key components:

Load balancer - Distributes traffic
Rate limiter - Protects against abuse
Message queue - Handles spikes
Agent pool - Scales horizontally
Validator - Catches hallucinations
Cache - Reduces costs 40%
Monitor - Real-time alerts

Real Numbers After 3 Months

Metric	Value
Total messages	52,847
Avg response time	847ms
Uptime	99.97%
Cost per message	$0.034
User satisfaction	4.6/5.0
Hallucinations caught	38
Abuse attempts blocked	2,847
Fallback activations	124

What I'd Do Differently

If I started over today:

✅ Start with rate limiting - Day 1, not day 30
✅ Build monitoring first - You can't fix what you can't see
✅ Plan for hallucinations - They WILL happen
✅ Design fallbacks early - Don't wait for an outage
✅ Cache aggressively - 40% cost reduction, zero effort

What worked perfectly:

SQLite for conversation history (yes, SQLite in production)
Bun for API server (3x faster than Node)
Simple rule-based fallbacks (saved my reputation twice)

The Code (Open Source)

Want to see the actual implementation? I open-sourced the core components:

GitHub: github.com/Richardmsbr/atlas-ai-chat

Includes:

Rate limiter with adaptive limits
Response validator
Context manager
Fallback system
Monitoring stack

Questions for You

I'm curious about your experience:

What's your biggest AI agent production challenge?
Have you dealt with hallucinations? How did you handle it?
What's your monitoring strategy?

Drop your answers below - I respond to every comment.

Want the full deep dive? I wrote a complete Portuguese version with more code examples on my blog: blog.sakaguchi.ia.br/blog/ai-agents-producao-realidade

Connect with me:

GitHub: @Richardmsbr
Building AI agents at scale
Solutions Architect focusing on production AI systems

_{Images: Unsplash}

DEV Community