TL;DR
I asked ChatGPT to generate rate limiting code. It gave me 50 lines that looked perfect.
Three weeks later, it cost us 14 engineering hours and nearly took down production.
The AI-generated code had 5 critical flaws:
- Memory leak (infinite growth)
- Lost state on restart
- Multi-server failure (12x limit bypass)
- Wrong client identification (load balancer IP)
- No observability
Lesson: Review AI code with production context in mind, not just syntax. Use the framework at the end of this post.
I've been working with LLMs in production for the past few years, and I've seen both the power and the pitfalls of AI-assisted development. Last month, we had a production incident that perfectly illustrates why you can't just copy-paste AI-generated code—even when it looks perfect.
Here's what happened, what we learned, and how you can avoid the same mistakes.
The Problem
Our API was getting hammered. What looked like a DDoS attack was actually legitimate traffic, but we needed rate limiting fast. I asked ChatGPT to generate a rate limiter, and it gave me code that looked perfect:
from flask import Flask, request
from functools import wraps
from time import time
app = Flask(__name__)
# Simple rate limiter
rate_limit_store = {}
def rate_limit(max_requests=100, window=60):
def decorator(f):
@wraps(f)
def decorated_function(*args, **kwargs):
client_ip = request.remote_addr
current_time = time()
if client_ip not in rate_limit_store:
rate_limit_store[client_ip] = []
# Clean old requests
rate_limit_store[client_ip] = [
req_time for req_time in rate_limit_store[client_ip]
if current_time - req_time < window
]
if len(rate_limit_store[client_ip]) >= max_requests:
return {"error": "Rate limit exceeded"}, 429
rate_limit_store[client_ip].append(current_time)
return f(*args, **kwargs)
return decorated_function
return decorator
@app.route('/api/chat')
@rate_limit(max_requests=100, window=60)
def chat_endpoint():
# Handle chat request
pass
The code was clean, well-structured, and followed best practices. It had error handling, clear logic, and would pass code review in many organizations. I reviewed it quickly and merged it.
The code worked fine for the first few days—we had low traffic, and the rate limiter seemed to be functioning. But as traffic increased and we scaled to more servers, we started seeing issues. Three weeks after deployment, we had production incidents that required 14 hours of debugging to resolve.
What Went Wrong
After 14 hours of debugging, we discovered the AI-generated code had 5 critical flaws that weren't obvious from reading it:
1. Memory Leak
The rate_limit_store dictionary grows infinitely. Every unique IP address creates a new entry that never gets cleaned up. After a week of production traffic, we'd have millions of entries consuming gigabytes of RAM.
# This grows forever - no cleanup mechanism
rate_limit_store = {}
2. Lost State on Restart
The in-memory store means rate limits reset every time we deploy. An attacker could simply wait for our daily deployment window and bypass all limits.
3. Multi-Server Failure
We run 12 API servers behind a load balancer. Each server has its own rate_limit_store, so a client can make 1,200 requests per minute (100 × 12 servers) instead of 100.
4. Wrong Client Identification
request.remote_addr gives you the load balancer's IP, not the client's IP. Every request looked like it came from the same source, making the rate limiter completely ineffective.
# This gets the load balancer IP, not the real client
client_ip = request.remote_addr
5. No Observability
Zero logging, zero metrics. We had no way to know if legitimate users were hitting limits or if the rate limiter was even working.
The Fix
After identifying the actual constraints, here's what we built. Here are the key changes:
1. Shared State with Redis
Before: In-memory dictionary (lost on restart, separate per server)
rate_limit_store = {} # Problem: separate per server, lost on restart
After: Redis with TTL (shared across servers, persists through deploys)
import redis
redis_client = redis.Redis(host='redis', decode_responses=True)
key = f"rate_limit:{identifier}:{limits['tier']}"
pipe = redis_client.pipeline()
pipe.incr(key)
pipe.expire(key, limits['window']) # Auto-cleanup via TTL
current_requests, _ = pipe.execute()
2. Correct Client Identification
Before: Gets load balancer IP
client_ip = request.remote_addr # Wrong: gets load balancer IP
After: Extracts real client IP from header
def get_client_identifier():
"""Extract true client IP from X-Forwarded-For header"""
if request.headers.get('X-Forwarded-For'):
return request.headers.get('X-Forwarded-For').split(',')[0].strip()
return request.remote_addr
3. Tiered Rate Limiting
Before: Same limit for everyone
max_requests=100 # Same for all users
After: Different limits based on customer tier
def get_rate_limit_tier(api_key):
if not api_key:
return {'requests': 100, 'window': 60, 'tier': 'anonymous'}
tier_info = get_customer_tier(api_key)
return {
'requests': tier_info.get('rate_limit', 1000), # Paid customers get more
'window': 60,
'tier': tier_info.get('plan', 'paid')
}
4. Observability
Before: No logging or metrics
# Silent failure - no way to know what's happening
After: Structured logging for monitoring
logger.warning(
f"Rate limit exceeded",
extra={
'identifier': identifier,
'tier': limits['tier'],
'requests': current_requests,
'limit': limits['requests'],
'endpoint': request.path
}
)
Summary of changes:
- Redis for state: Shared state across all servers, persists through deploys, automatic cleanup via TTL
- Tiered rate limiting: Different limits for anonymous users, paid customers, enterprise clients
- Correct client identification: X-Forwarded-For header with fallback
- Atomic operations: Redis pipeline ensures race conditions can't bypass limits
- Observability: Structured logging for monitoring and alerting
-
Better error messages: Include
retry_afterso clients know when to retry
💡 Key Takeaway: AI generates code that works in isolation but fails in production because it doesn't understand your infrastructure, constraints, or operational requirements.
What We Learned
This experience taught us three critical lessons:
1. AI Generates Code That Looks Correct But Misses Production Realities
The AI-generated code followed patterns, had error handling, and looked professional. But it didn't understand:
- Our infrastructure (12 servers behind a load balancer)
- Our operational requirements (persistence, observability)
- Our business constraints (different limits for different customers)
The code worked in isolation but failed in production.
2. The "Boring" Code Is Where Bugs Live
Error handling, resource cleanup, observability—these are the "boring" parts that AI generates quickly. But they're also where most production bugs live. When you delegate this to AI without careful review, you're offloading the most critical parts.
3. You Still Need Deep Systems Understanding
LLMs can accelerate implementation, but they can't replace the reasoning that prevents disasters. You need to:
- Understand your infrastructure deeply
- Anticipate failure modes
- Design for production, not just correctness
A Framework for Reviewing AI-Generated Code
After this incident, we developed a systematic checklist for reviewing AI-generated code:
Security Checklist
- [ ] Input validation (can users inject malicious data?)
- [ ] Authentication/authorization (are permissions checked?)
- [ ] Sensitive data handling (is logging exposing secrets?)
- [ ] SQL injection / XSS vulnerabilities
Production Readiness
- [ ] Error handling (what happens when things fail?)
- [ ] Resource cleanup (memory leaks, connection pools)
- [ ] Observability (logging, metrics, tracing)
- [ ] Performance (will this scale? N+1 queries?)
Context & Integration
- [ ] Matches existing patterns (or breaks conventions?)
- [ ] Uses correct libraries (or suggests deprecated ones?)
- [ ] Fits architecture (or creates technical debt?)
- [ ] Handles edge cases (null values, empty arrays, etc.)
Understanding
- [ ] Do I understand every line?
- [ ] Can I explain why this approach was chosen?
- [ ] Do I know what will break if this changes?
The rule: If you can't confidently answer "yes" to all of these, don't merge it.
The Bottom Line
LLMs are powerful tools. They can generate syntactically correct code in seconds. But they can't understand your infrastructure, your constraints, or your operational requirements.
Use AI to accelerate implementation, but maintain ownership of the reasoning.
The code that looks perfect might have subtle bugs that only become obvious when you understand systems deeply. The "boring" code deserves extra scrutiny, not less.
And most importantly: if you can't explain why the code works, don't ship it.
What's your framework for reviewing AI-generated code? Share it in the comments below!
If you found this useful, I've documented this and 8 other production case studies in my book, Being a Software Developer After LLMs. It covers frameworks for working with LLMs strategically while maintaining your core engineering skills.
Top comments (0)