DEV Community

Cover image for I Built Production AI Agents That Handle 50K Messages/Month - Here's What the Tutorials Won't Tell You
Richard Sakaguchi
Richard Sakaguchi

Posted on • Originally published at blog.sakaguchi.ia.br

I Built Production AI Agents That Handle 50K Messages/Month - Here's What the Tutorials Won't Tell You

AI Agent Dashboard

Three months ago, I deployed an AI agent to production. Today, it handles 50,000+ messages monthly with zero downtime. But here's the thing - none of the tutorials prepared me for what actually happened.

Everyone shows you the shiny "hello world" chatbot. Nobody shows you what happens when real users spam your API at 3 AM, or when your LLM decides to hallucinate customer data.

This is that story.

The Promise vs. The Reality

What tutorials show you:

# The "perfect" AI agent
agent = AIAgent(model="gpt-4")
response = agent.chat("Hello!")
print(response)  # Magic! ✨
Enter fullscreen mode Exit fullscreen mode

What production looks like:

System Architecture

graph TB
    A[User Message] --> B{Rate Limiter}
    B -->|Allowed| C[Queue System]
    B -->|Blocked| D[429 Response]
    C --> E{Health Check}
    E -->|Healthy| F[AI Agent]
    E -->|Degraded| G[Fallback Handler]
    F --> H{Response Validator}
    H -->|Valid| I[User]
    H -->|Hallucination| J[Retry Logic]
    G --> I
    J --> F
Enter fullscreen mode Exit fullscreen mode

Notice the difference? Production AI agents need six layers of protection that tutorials never mention.

The Five Hard Truths About Production AI Agents

1. Rate Limiting Isn't Optional - It's Survival

The tutorial way:

# YOLO approach
while True:
    message = get_message()
    response = ai_agent.process(message)
Enter fullscreen mode Exit fullscreen mode

The production way:

from collections import defaultdict
from datetime import datetime, timedelta

class AdaptiveRateLimiter:
    def __init__(self, base_limit=100):
        self.limits = defaultdict(lambda: {"count": 0, "reset": datetime.now()})
        self.base_limit = base_limit

    def check_limit(self, user_id: str, risk_score: float) -> bool:
        """Adaptive rate limiting based on user behavior"""
        limit_data = self.limits[user_id]

        # Reset window
        if datetime.now() > limit_data["reset"]:
            limit_data["count"] = 0
            limit_data["reset"] = datetime.now() + timedelta(hours=1)

        # Adjust limit based on risk
        adjusted_limit = int(self.base_limit * (1 - risk_score))

        if limit_data["count"] >= adjusted_limit:
            return False

        limit_data["count"] += 1
        return True
Enter fullscreen mode Exit fullscreen mode

Rate Limiting Dashboard

Why it matters: In month one, I blocked 2,847 abuse attempts. Without rate limiting, that's $500+ in wasted API calls.

2. LLMs Hallucinate - Always Validate Output

This one hurt. A user asked for their account balance. The AI agent confidently responded: "Your balance is $127,549.32"

Actual balance? $47.15

The fix:

import re
from typing import Optional

class ResponseValidator:
    def __init__(self):
        # Patterns that should NEVER appear in responses
        self.forbidden_patterns = [
            r'\$[\d,]+\.\d{2}',  # Dollar amounts
            r'\b\d{3}-\d{2}-\d{4}\b',  # SSN
            r'\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b',  # Emails
        ]

    def validate(self, response: str, user_context: dict) -> Optional[str]:
        """Validate AI response against business rules"""

        # Check for hallucinated data
        for pattern in self.forbidden_patterns:
            if re.search(pattern, response, re.IGNORECASE):
                return None  # Reject response

        # Verify facts against database
        if "balance" in response.lower():
            claimed_balance = self.extract_balance(response)
            actual_balance = user_context.get("balance")

            if claimed_balance and abs(claimed_balance - actual_balance) > 0.01:
                return None  # Hallucination detected

        return response
Enter fullscreen mode Exit fullscreen mode

Result: Zero incidents of hallucinated financial data in production.

3. Context Window Management Is an Art

Context Window Visualization

Here's what nobody tells you: managing conversation context at scale is harder than building the agent itself.

from collections import deque
from dataclasses import dataclass
from typing import List

@dataclass
class Message:
    role: str
    content: str
    tokens: int
    importance: float  # 0-1 score

class SmartContextManager:
    def __init__(self, max_tokens: int = 4000):
        self.max_tokens = max_tokens
        self.messages = deque()

    def add_message(self, message: Message):
        self.messages.append(message)
        self._trim_context()

    def _trim_context(self):
        """Keep most important messages within token limit"""
        total_tokens = sum(m.tokens for m in self.messages)

        if total_tokens <= self.max_tokens:
            return

        # Sort by importance, keep system prompts
        sorted_msgs = sorted(
            [m for m in self.messages if m.role != "system"],
            key=lambda x: x.importance
        )

        # Remove least important until we fit
        while total_tokens > self.max_tokens and sorted_msgs:
            removed = sorted_msgs.pop(0)
            self.messages.remove(removed)
            total_tokens -= removed.tokens
Enter fullscreen mode Exit fullscreen mode

This saved me ~$1,200/month in API costs by intelligently pruning conversation history.

4. Monitoring Needs to Be Obsessive

Metrics that actually matter:

pie title "What Breaks AI Agents in Production"
    "Rate Limit Abuse" : 35
    "LLM Timeouts" : 25
    "Hallucinations" : 20
    "Network Issues" : 15
    "Database Locks" : 5
Enter fullscreen mode Exit fullscreen mode

My monitoring stack:

from dataclasses import dataclass
from datetime import datetime
import logging

@dataclass
class AgentMetrics:
    timestamp: datetime
    response_time_ms: float
    tokens_used: int
    cost_usd: float
    user_satisfaction: float
    error_type: Optional[str]

    def log(self):
        logging.info(
            f"agent_response",
            extra={
                "duration_ms": self.response_time_ms,
                "tokens": self.tokens_used,
                "cost": self.cost_usd,
                "satisfaction": self.user_satisfaction,
                "error": self.error_type
            }
        )

class AgentMonitor:
    def __init__(self):
        self.metrics = []
        self.alerts = {
            "high_latency": 2000,  # ms
            "low_satisfaction": 0.6,  # 0-1
            "error_rate": 0.05  # 5%
        }

    async def track_request(self, request_fn):
        start = datetime.now()
        error = None

        try:
            result = await request_fn()
            satisfaction = self.calculate_satisfaction(result)
        except Exception as e:
            error = str(e)
            raise
        finally:
            duration = (datetime.now() - start).total_seconds() * 1000

            metric = AgentMetrics(
                timestamp=datetime.now(),
                response_time_ms=duration,
                tokens_used=getattr(result, 'tokens', 0),
                cost_usd=self.calculate_cost(result),
                user_satisfaction=satisfaction if error is None else 0,
                error_type=error
            )

            metric.log()
            self.check_alerts(metric)
Enter fullscreen mode Exit fullscreen mode

Monitoring Dashboard

5. Fallbacks Save Your Reputation

The moment of truth: Your AI provider goes down at 2 AM. What happens?

Bad approach:

# Hope and pray
response = openai.ChatCompletion.create(...)
Enter fullscreen mode Exit fullscreen mode

Production approach:

from typing import List, Callable
import asyncio

class AIAgentWithFallbacks:
    def __init__(self):
        self.providers = [
            self.primary_ai,      # OpenAI GPT-4
            self.secondary_ai,    # Anthropic Claude
            self.rule_based,      # Template responses
            self.human_handoff    # Last resort
        ]

    async def get_response(self, message: str, max_retries: int = 3) -> str:
        """Try providers in order until success"""

        for provider in self.providers:
            for attempt in range(max_retries):
                try:
                    response = await provider(message)
                    if self.is_valid_response(response):
                        return response
                except Exception as e:
                    logging.warning(f"{provider.__name__} failed: {e}")
                    await asyncio.sleep(2 ** attempt)  # Exponential backoff
                    continue

        # All providers failed
        return "I apologize, but I'm having technical difficulties. A human agent will assist you shortly."
Enter fullscreen mode Exit fullscreen mode

Stats from production:

  • Primary provider uptime: 99.2%
  • Fallback triggers: 124 times/month
  • User complaints about downtime: 0

The Architecture That Actually Works

After three months of iteration, here's the stack:

graph LR
    A[User] --> B[Load Balancer]
    B --> C[API Gateway]
    C --> D{Rate Limiter}
    D --> E[Message Queue]
    E --> F[Agent Pool]
    F --> G[Primary AI]
    F --> H[Fallback AI]
    F --> I[Rules Engine]
    G --> J[Validator]
    H --> J
    I --> J
    J --> K[Response Cache]
    K --> A

    L[Monitor] -.-> F
    L -.-> G
    L -.-> H
    M[Database] -.-> F
Enter fullscreen mode Exit fullscreen mode

Production Architecture

Key components:

  1. Load balancer - Distributes traffic
  2. Rate limiter - Protects against abuse
  3. Message queue - Handles spikes
  4. Agent pool - Scales horizontally
  5. Validator - Catches hallucinations
  6. Cache - Reduces costs 40%
  7. Monitor - Real-time alerts

Real Numbers After 3 Months

Metric Value
Total messages 52,847
Avg response time 847ms
Uptime 99.97%
Cost per message $0.034
User satisfaction 4.6/5.0
Hallucinations caught 38
Abuse attempts blocked 2,847
Fallback activations 124

Results Dashboard

What I'd Do Differently

If I started over today:

  1. Start with rate limiting - Day 1, not day 30
  2. Build monitoring first - You can't fix what you can't see
  3. Plan for hallucinations - They WILL happen
  4. Design fallbacks early - Don't wait for an outage
  5. Cache aggressively - 40% cost reduction, zero effort

What worked perfectly:

  • SQLite for conversation history (yes, SQLite in production)
  • Bun for API server (3x faster than Node)
  • Simple rule-based fallbacks (saved my reputation twice)

The Code (Open Source)

Want to see the actual implementation? I open-sourced the core components:

GitHub: github.com/Richardmsbr/atlas-ai-chat

Includes:

  • Rate limiter with adaptive limits
  • Response validator
  • Context manager
  • Fallback system
  • Monitoring stack

Questions for You

I'm curious about your experience:

  1. What's your biggest AI agent production challenge?
  2. Have you dealt with hallucinations? How did you handle it?
  3. What's your monitoring strategy?

Drop your answers below - I respond to every comment.


Want the full deep dive? I wrote a complete Portuguese version with more code examples on my blog: blog.sakaguchi.ia.br/blog/ai-agents-producao-realidade

Connect with me:

  • GitHub: @Richardmsbr
  • Building AI agents at scale
  • Solutions Architect focusing on production AI systems

Images: Unsplash

Top comments (0)