DEV Community: Richard Sakaguchi

I Built Production AI Agents That Handle 50K Messages/Month - Here's What the Tutorials Won't Tell You

Richard Sakaguchi — Mon, 15 Dec 2025 21:54:46 +0000

Three months ago, I deployed an AI agent to production. Today, it handles 50,000+ messages monthly with zero downtime. But here's the thing - none of the tutorials prepared me for what actually happened.

Everyone shows you the shiny "hello world" chatbot. Nobody shows you what happens when real users spam your API at 3 AM, or when your LLM decides to hallucinate customer data.

This is that story.

The Promise vs. The Reality

What tutorials show you:

# The "perfect" AI agent
agent = AIAgent(model="gpt-4")
response = agent.chat("Hello!")
print(response)  # Magic! ✨

What production looks like:

graph TB
    A[User Message] --> B{Rate Limiter}
    B -->|Allowed| C[Queue System]
    B -->|Blocked| D[429 Response]
    C --> E{Health Check}
    E -->|Healthy| F[AI Agent]
    E -->|Degraded| G[Fallback Handler]
    F --> H{Response Validator}
    H -->|Valid| I[User]
    H -->|Hallucination| J[Retry Logic]
    G --> I
    J --> F

Notice the difference? Production AI agents need six layers of protection that tutorials never mention.

The Five Hard Truths About Production AI Agents

1. Rate Limiting Isn't Optional - It's Survival

The tutorial way:

# YOLO approach
while True:
    message = get_message()
    response = ai_agent.process(message)

The production way:

from collections import defaultdict
from datetime import datetime, timedelta

class AdaptiveRateLimiter:
    def __init__(self, base_limit=100):
        self.limits = defaultdict(lambda: {"count": 0, "reset": datetime.now()})
        self.base_limit = base_limit

    def check_limit(self, user_id: str, risk_score: float) -> bool:
        """Adaptive rate limiting based on user behavior"""
        limit_data = self.limits[user_id]

        # Reset window
        if datetime.now() > limit_data["reset"]:
            limit_data["count"] = 0
            limit_data["reset"] = datetime.now() + timedelta(hours=1)

        # Adjust limit based on risk
        adjusted_limit = int(self.base_limit * (1 - risk_score))

        if limit_data["count"] >= adjusted_limit:
            return False

        limit_data["count"] += 1
        return True

Why it matters: In month one, I blocked 2,847 abuse attempts. Without rate limiting, that's $500+ in wasted API calls.

2. LLMs Hallucinate - Always Validate Output

This one hurt. A user asked for their account balance. The AI agent confidently responded: "Your balance is $127,549.32"

Actual balance? $47.15

The fix:

import re
from typing import Optional

class ResponseValidator:
    def __init__(self):
        # Patterns that should NEVER appear in responses
        self.forbidden_patterns = [
            r'\$[\d,]+\.\d{2}',  # Dollar amounts
            r'\b\d{3}-\d{2}-\d{4}\b',  # SSN
            r'\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b',  # Emails
        ]

    def validate(self, response: str, user_context: dict) -> Optional[str]:
        """Validate AI response against business rules"""

        # Check for hallucinated data
        for pattern in self.forbidden_patterns:
            if re.search(pattern, response, re.IGNORECASE):
                return None  # Reject response

        # Verify facts against database
        if "balance" in response.lower():
            claimed_balance = self.extract_balance(response)
            actual_balance = user_context.get("balance")

            if claimed_balance and abs(claimed_balance - actual_balance) > 0.01:
                return None  # Hallucination detected

        return response

Result: Zero incidents of hallucinated financial data in production.

3. Context Window Management Is an Art

Here's what nobody tells you: managing conversation context at scale is harder than building the agent itself.

from collections import deque
from dataclasses import dataclass
from typing import List

@dataclass
class Message:
    role: str
    content: str
    tokens: int
    importance: float  # 0-1 score

class SmartContextManager:
    def __init__(self, max_tokens: int = 4000):
        self.max_tokens = max_tokens
        self.messages = deque()

    def add_message(self, message: Message):
        self.messages.append(message)
        self._trim_context()

    def _trim_context(self):
        """Keep most important messages within token limit"""
        total_tokens = sum(m.tokens for m in self.messages)

        if total_tokens <= self.max_tokens:
            return

        # Sort by importance, keep system prompts
        sorted_msgs = sorted(
            [m for m in self.messages if m.role != "system"],
            key=lambda x: x.importance
        )

        # Remove least important until we fit
        while total_tokens > self.max_tokens and sorted_msgs:
            removed = sorted_msgs.pop(0)
            self.messages.remove(removed)
            total_tokens -= removed.tokens

This saved me ~$1,200/month in API costs by intelligently pruning conversation history.

4. Monitoring Needs to Be Obsessive

Metrics that actually matter:

pie title "What Breaks AI Agents in Production"
    "Rate Limit Abuse" : 35
    "LLM Timeouts" : 25
    "Hallucinations" : 20
    "Network Issues" : 15
    "Database Locks" : 5

My monitoring stack:

from dataclasses import dataclass
from datetime import datetime
import logging

@dataclass
class AgentMetrics:
    timestamp: datetime
    response_time_ms: float
    tokens_used: int
    cost_usd: float
    user_satisfaction: float
    error_type: Optional[str]

    def log(self):
        logging.info(
            f"agent_response",
            extra={
                "duration_ms": self.response_time_ms,
                "tokens": self.tokens_used,
                "cost": self.cost_usd,
                "satisfaction": self.user_satisfaction,
                "error": self.error_type
            }
        )

class AgentMonitor:
    def __init__(self):
        self.metrics = []
        self.alerts = {
            "high_latency": 2000,  # ms
            "low_satisfaction": 0.6,  # 0-1
            "error_rate": 0.05  # 5%
        }

    async def track_request(self, request_fn):
        start = datetime.now()
        error = None

        try:
            result = await request_fn()
            satisfaction = self.calculate_satisfaction(result)
        except Exception as e:
            error = str(e)
            raise
        finally:
            duration = (datetime.now() - start).total_seconds() * 1000

            metric = AgentMetrics(
                timestamp=datetime.now(),
                response_time_ms=duration,
                tokens_used=getattr(result, 'tokens', 0),
                cost_usd=self.calculate_cost(result),
                user_satisfaction=satisfaction if error is None else 0,
                error_type=error
            )

            metric.log()
            self.check_alerts(metric)

5. Fallbacks Save Your Reputation

The moment of truth: Your AI provider goes down at 2 AM. What happens?

Bad approach:

# Hope and pray
response = openai.ChatCompletion.create(...)

Production approach:

from typing import List, Callable
import asyncio

class AIAgentWithFallbacks:
    def __init__(self):
        self.providers = [
            self.primary_ai,      # OpenAI GPT-4
            self.secondary_ai,    # Anthropic Claude
            self.rule_based,      # Template responses
            self.human_handoff    # Last resort
        ]

    async def get_response(self, message: str, max_retries: int = 3) -> str:
        """Try providers in order until success"""

        for provider in self.providers:
            for attempt in range(max_retries):
                try:
                    response = await provider(message)
                    if self.is_valid_response(response):
                        return response
                except Exception as e:
                    logging.warning(f"{provider.__name__} failed: {e}")
                    await asyncio.sleep(2 ** attempt)  # Exponential backoff
                    continue

        # All providers failed
        return "I apologize, but I'm having technical difficulties. A human agent will assist you shortly."

Stats from production:

Primary provider uptime: 99.2%
Fallback triggers: 124 times/month
User complaints about downtime: 0

The Architecture That Actually Works

After three months of iteration, here's the stack:

graph LR
    A[User] --> B[Load Balancer]
    B --> C[API Gateway]
    C --> D{Rate Limiter}
    D --> E[Message Queue]
    E --> F[Agent Pool]
    F --> G[Primary AI]
    F --> H[Fallback AI]
    F --> I[Rules Engine]
    G --> J[Validator]
    H --> J
    I --> J
    J --> K[Response Cache]
    K --> A

    L[Monitor] -.-> F
    L -.-> G
    L -.-> H
    M[Database] -.-> F

Key components:

Load balancer - Distributes traffic
Rate limiter - Protects against abuse
Message queue - Handles spikes
Agent pool - Scales horizontally
Validator - Catches hallucinations
Cache - Reduces costs 40%
Monitor - Real-time alerts

Real Numbers After 3 Months

Metric	Value
Total messages	52,847
Avg response time	847ms
Uptime	99.97%
Cost per message	$0.034
User satisfaction	4.6/5.0
Hallucinations caught	38
Abuse attempts blocked	2,847
Fallback activations	124

What I'd Do Differently

If I started over today:

✅ Start with rate limiting - Day 1, not day 30
✅ Build monitoring first - You can't fix what you can't see
✅ Plan for hallucinations - They WILL happen
✅ Design fallbacks early - Don't wait for an outage
✅ Cache aggressively - 40% cost reduction, zero effort

What worked perfectly:

SQLite for conversation history (yes, SQLite in production)
Bun for API server (3x faster than Node)
Simple rule-based fallbacks (saved my reputation twice)

The Code (Open Source)

Want to see the actual implementation? I open-sourced the core components:

GitHub: github.com/Richardmsbr/atlas-ai-chat

Includes:

Rate limiter with adaptive limits
Response validator
Context manager
Fallback system
Monitoring stack

Questions for You

I'm curious about your experience:

What's your biggest AI agent production challenge?
Have you dealt with hallucinations? How did you handle it?
What's your monitoring strategy?

Drop your answers below - I respond to every comment.

Want the full deep dive? I wrote a complete Portuguese version with more code examples on my blog: blog.sakaguchi.ia.br/blog/ai-agents-producao-realidade

Connect with me:

GitHub: @Richardmsbr
Building AI agents at scale
Solutions Architect focusing on production AI systems

Images: Unsplash

Fine-Tuning LLMs on Consumer GPUs: A Practical Guide to QLoRA

Richard Sakaguchi — Wed, 10 Dec 2025 02:20:42 +0000

Fine-Tuning LLMs on Consumer GPUs: A Practical Guide to QLoRA

No A100. No cloud credits. Just a 3090 and determination.

The Myth

"You need $10,000+ in cloud compute to fine-tune an LLM."

Reality: I fine-tuned Mistral-7B on a single RTX 3090 for $0.

What is QLoRA?

QLoRA = Quantized Low-Rank Adaptation

Quantization: Compress model weights from 32-bit to 4-bit
LoRA: Train small adapter layers instead of full model
Result: 7B model fits in ~6GB VRAM instead of 28GB+

The Setup

Hardware

GPU: RTX 3090 (24GB) - works with 3080 too
RAM: 32GB (16GB minimum)
Storage: 50GB free space

Software Stack

pip install torch transformers peft bitsandbytes trl datasets

The Code

1. Load Model in 4-bit

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig
)
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.3",
    quantization_config=bnb_config,
    device_map="auto"
)

2. Configure LoRA

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,                    # Rank
    lora_alpha=32,           # Alpha scaling
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 20,971,520 || all params: 7,261,749,248
# trainable%: 0.29%

3. Prepare Dataset

from datasets import load_dataset

dataset = load_dataset(
    "RichardSakaguchiMS/brazilian-customer-service-conversations"
)

def format_example(example):
    return {
        "text": f"""<s>[INST] {example['input']} [/INST]
{example['output']}</s>"""
    }

dataset = dataset.map(format_example)

4. Train

from trl import SFTTrainer
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    bf16=True,
    logging_steps=10,
    save_strategy="epoch"
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    args=training_args,
    tokenizer=tokenizer,
    max_seq_length=2048,
    dataset_text_field="text"
)

trainer.train()

Training Stats

Metric	Value
Dataset	10,000 examples
Epochs	3
Batch Size	4 x 4 (gradient acc)
Training Time	~4 hours
VRAM Peak	18GB
Final Loss	0.82

Tips and Tricks

1. Gradient Checkpointing

model.gradient_checkpointing_enable()

Saves VRAM at cost of ~20% slower training.

2. Flash Attention 2

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    attn_implementation="flash_attention_2"
)

2x faster, less VRAM.

3. Data Quality > Quantity

1,000 high-quality examples > 100,000 noisy examples
Clean your data!
Validate format consistency

4. Monitor Loss Curve

If loss plateaus: increase learning rate
If loss spikes: decrease learning rate
If loss oscillates: decrease batch size or lr

Inference

from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.3",
    quantization_config=bnb_config
)
model = PeftModel.from_pretrained(base_model, "./output")

prompt = "[INST] Cliente: Quero saber do meu pedido [/INST]"
outputs = model.generate(
    tokenizer.encode(prompt, return_tensors="pt"),
    max_new_tokens=200
)
print(tokenizer.decode(outputs[0]))

When NOT to Fine-Tune

Simple prompt engineering works
You have < 1,000 examples
Task is too generic (use base model)
Budget for API calls is acceptable

When TO Fine-Tune

Specific domain language (legal, medical, regional)
Consistent output format required
Privacy requirements (no cloud APIs)
Cost optimization at scale

Open Source

Model and dataset used in this guide:

Questions? Drop them in the comments!

sakaguchi.ia.br | GitHub

WhatsApp AI Bot in Production: 3 Months, 50K Messages, Zero Downtime

Richard Sakaguchi — Wed, 10 Dec 2025 02:19:57 +0000

WhatsApp AI Bot in Production: 3 Months, 50K Messages, Zero Downtime

The Challenge

My client had a problem: 200+ WhatsApp messages per day, 2 people answering, and still losing customers because response time was 2+ hours during peak times.

Their ask: "Can you make a bot that actually works?"

The Stack

WhatsApp Business API
        |
    Evolution API (self-hosted)
        |
    FastAPI Backend
        |
    Yoshii IA (Brazilian Portuguese LLM)
        |
    PostgreSQL + Redis

What Makes It Different

1. It Actually Understands Portuguese

Not translated English. Native Brazilian Portuguese.

Customer: "ce tem a blusa azul em P?"
(Informal: "u got the blue shirt in S?")

Bot: "Temos sim! A blusa azul ta disponivel em P, M e G.
     Quer que eu reserve pra voce?"

2. Smart Handoff

Bot handles 80% of queries. Complex cases go to humans with full context:

if sentiment_score < 0.3 or is_complaint:
    handoff_to_human(
        conversation=conv,
        reason="frustrated_customer",
        context=summary
    )

3. Business Hours Awareness

def get_response(message):
    if not is_business_hours():
        return BOT_RESPONSE  # Full automation
    elif human_available():
        return HYBRID_MODE   # Bot + human
    else:
        return BOT_RESPONSE  # Fallback to bot

The Numbers (Real Data)

Metric	Before	After
Avg Response Time	2h 15min	12 seconds
Messages/day handled	80	200+
Staff needed	2	0.5 (oversight)
Customer satisfaction	65%	89%
Operating cost	$2,500/mo	$400/mo

Lessons Learned the Hard Way

1. Rate Limiting is Real

WhatsApp will ban you if you send too many messages too fast.

async def send_message(to, text):
    async with rate_limiter:
        await asyncio.sleep(1)  # Minimum delay
        return await api.send(to, text)

2. Media Handling is Tricky

Customers send voice messages, images, videos. You need to handle all of them:

match message.type:
    case "text":
        return process_text(message)
    case "audio":
        text = await whisper_transcribe(message.audio)
        return process_text(text)
    case "image":
        return "Got your image! Let me take a look..."

3. Context is Everything

Store conversation history. Customers hate repeating themselves:

context = redis.get(f"conv:{phone_number}")
last_messages = context.messages[-5:]  # Last 5 messages

response = llm.generate(
    system="You are a helpful assistant...",
    context=last_messages,
    user_message=new_message
)

4. Graceful Degradation

LLM down? Have fallbacks:

try:
    response = await yoshii_api.generate(prompt)
except TimeoutError:
    response = FALLBACK_RESPONSES.get(
        detect_intent(message),
        "Sorry, having issues. Human will respond soon!"
    )

The Architecture

+-------------+     +----------------+     +----------+
| WhatsApp    |---->| Evolution API  |---->| Webhook  |
| Cloud API   |     | (self-hosted)  |     | Handler  |
+-------------+     +----------------+     +----------+
                                                |
                    +---------------------------+
                    |
              +-----v-----+     +---------+
              | Message   |---->| Yoshii  |
              | Processor |     | LLM API |
              +-----------+     +---------+
                    |
              +-----v-----+
              | Response  |
              | Generator |
              +-----------+
                    |
              +-----v-----+
              | Queue     |-----> Send via WhatsApp
              +-----------+

Cost Breakdown

Item	Monthly Cost
WhatsApp Business API	$50
VPS (4GB RAM)	$20
LLM Inference (self-hosted)	$0
Redis Cloud	$0 (free tier)
PostgreSQL	$0 (same VPS)
Total	$70/month

Open Source

The LLM powering this is open source:

Model: yoshii-ai/Yoshii-7B-BR
Dataset: brazilian-customer-service-conversations

What's Next

Voice message processing (Whisper integration)
Proactive messaging (order status updates)
Multi-language support
Analytics dashboard

Building something similar? Happy to help in the comments!

sakaguchi.ia.br | WhatsApp

I Built a Brazilian Portuguese LLM from Scratch - Here's What I Learned

Richard Sakaguchi — Wed, 10 Dec 2025 02:18:28 +0000

I Built a Brazilian Portuguese LLM from Scratch - Here's What I Learned

The Problem

Most AI models are trained on English data. Like, 90%+ English. Portuguese? Less than 2%. Brazilian Portuguese? Even rarer.

This creates real problems:

Customer: "Tô de boa, só quero dar uma olhada"
(Translation: "I'm cool, just browsing")

GPT: "I don't understand. Could you rephrase?"

Or worse:

Customer: "Vocês aceitam PIX?"
(PIX = Brazil's instant payment system, used by 150M+ people)

GPT: "What is PIX?"

The Solution: Yoshii IA

I fine-tuned Mistral-7B on real Brazilian customer service conversations to create a model that actually understands:

🇧🇷 Brazilian slang and expressions
💳 Local context (PIX, CPF, CEP)
🗣️ Natural conversational Portuguese
🤝 Customer service best practices

The Technical Journey

1. Data Collection

I built a dataset of 10,000+ real customer service conversations in Brazilian Portuguese:

WhatsApp business chats
Support tickets
E-commerce interactions
Various industries (healthcare, retail, restaurants)

Dataset is open source: HuggingFace Dataset

2. Training Setup

Used QLoRA (Quantized LoRA) to fine-tune on consumer hardware:

# 4-bit quantization + LoRA
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4"
)

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05
)

Hardware: Single RTX 3090 (24GB VRAM)
Training time: ~4 hours
VRAM usage: ~6GB with 4-bit quantization

3. Results

Before (GPT-4):

User: "E aí, tudo certo?"
Bot: "Olá! Como posso ajudá-lo hoje?"

After (Yoshii IA):

User: "E aí, tudo certo?"
Bot: "E aí! Tudo certinho sim! E você? 😊 Em que posso ajudar?"

Open Source Everything

🤗 Model: yoshii-ai/Yoshii-7B-BR
📊 Dataset: brazilian-customer-service-conversations
🌐 Live Demo: yoshii.sakaguchi.ia.br
💬 WhatsApp Bot: Production-ready, handling real customers

What's Next

📢 Voice support (STT + TTS)
📊 Sentiment analysis for Portuguese
🔮 Predictive customer support
🤖 Multi-agent orchestration

Lessons Learned

Data quality > quantity - 10K well-curated samples beat 100K messy ones
Cultural context matters - PIX, CPF, CEP aren't just words, they're institutions
QLoRA is magic - Fine-tuning 7B models on consumer GPUs? Yes please
Portuguese isn't one language - Brazilian Portuguese ≠ European Portuguese

Try It

# Quick inference
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("yoshii-ai/Yoshii-7B-BR")
tokenizer = AutoTokenizer.from_pretrained("yoshii-ai/Yoshii-7B-BR")

prompt = "Cliente: Oi, quero saber do meu pedido\nAtendente:"
output = model.generate(tokenizer.encode(prompt, return_tensors="pt"))
print(tokenizer.decode(output[0]))

Building for your local market? Open source your work. The community will thank you. 🙌

Questions? Comments? Let me know below!

📍 São Paulo, Brazil
🔗 sakaguchi.ia.br
💬 WhatsApp