ackermannQ

Posted on Jan 13

AI generated correct garbage

#webdev #ai #programming #career

TL;DR

I asked ChatGPT to generate rate limiting code. It gave me 50 lines that looked perfect.

Three weeks later, it cost us 14 engineering hours and nearly took down production.

The AI-generated code had 5 critical flaws:

Memory leak (infinite growth)
Lost state on restart
Multi-server failure (12x limit bypass)
Wrong client identification (load balancer IP)
No observability

Lesson: Review AI code with production context in mind, not just syntax. Use the framework at the end of this post.

I've been working with LLMs in production for the past few years, and I've seen both the power and the pitfalls of AI-assisted development. Last month, we had a production incident that perfectly illustrates why you can't just copy-paste AI-generated code—even when it looks perfect.

Here's what happened, what we learned, and how you can avoid the same mistakes.

The Problem

Our API was getting hammered. What looked like a DDoS attack was actually legitimate traffic, but we needed rate limiting fast. I asked ChatGPT to generate a rate limiter, and it gave me code that looked perfect:

from flask import Flask, request
from functools import wraps
from time import time

app = Flask(__name__)

# Simple rate limiter
rate_limit_store = {}

def rate_limit(max_requests=100, window=60):
    def decorator(f):
        @wraps(f)
        def decorated_function(*args, **kwargs):
            client_ip = request.remote_addr
            current_time = time()

            if client_ip not in rate_limit_store:
                rate_limit_store[client_ip] = []

            # Clean old requests
            rate_limit_store[client_ip] = [
                req_time for req_time in rate_limit_store[client_ip]
                if current_time - req_time < window
            ]

            if len(rate_limit_store[client_ip]) >= max_requests:
                return {"error": "Rate limit exceeded"}, 429

            rate_limit_store[client_ip].append(current_time)
            return f(*args, **kwargs)
        return decorated_function
    return decorator

@app.route('/api/chat')
@rate_limit(max_requests=100, window=60)
def chat_endpoint():
    # Handle chat request
    pass

The code was clean, well-structured, and followed best practices. It had error handling, clear logic, and would pass code review in many organizations. I reviewed it quickly and merged it.

The code worked fine for the first few days—we had low traffic, and the rate limiter seemed to be functioning. But as traffic increased and we scaled to more servers, we started seeing issues. Three weeks after deployment, we had production incidents that required 14 hours of debugging to resolve.

What Went Wrong

After 14 hours of debugging, we discovered the AI-generated code had 5 critical flaws that weren't obvious from reading it:

1. Memory Leak

The rate_limit_store dictionary grows infinitely. Every unique IP address creates a new entry that never gets cleaned up. After a week of production traffic, we'd have millions of entries consuming gigabytes of RAM.

# This grows forever - no cleanup mechanism
rate_limit_store = {}

2. Lost State on Restart

The in-memory store means rate limits reset every time we deploy. An attacker could simply wait for our daily deployment window and bypass all limits.

3. Multi-Server Failure

We run 12 API servers behind a load balancer. Each server has its own rate_limit_store, so a client can make 1,200 requests per minute (100 × 12 servers) instead of 100.

4. Wrong Client Identification

request.remote_addr gives you the load balancer's IP, not the client's IP. Every request looked like it came from the same source, making the rate limiter completely ineffective.

# This gets the load balancer IP, not the real client
client_ip = request.remote_addr

5. No Observability

Zero logging, zero metrics. We had no way to know if legitimate users were hitting limits or if the rate limiter was even working.

The Fix

After identifying the actual constraints, here's what we built. Here are the key changes:

1. Shared State with Redis

Before: In-memory dictionary (lost on restart, separate per server)

rate_limit_store = {}  # Problem: separate per server, lost on restart

After: Redis with TTL (shared across servers, persists through deploys)

import redis
redis_client = redis.Redis(host='redis', decode_responses=True)

key = f"rate_limit:{identifier}:{limits['tier']}"
pipe = redis_client.pipeline()
pipe.incr(key)
pipe.expire(key, limits['window'])  # Auto-cleanup via TTL
current_requests, _ = pipe.execute()

2. Correct Client Identification

Before: Gets load balancer IP

client_ip = request.remote_addr  # Wrong: gets load balancer IP

After: Extracts real client IP from header

def get_client_identifier():
    """Extract true client IP from X-Forwarded-For header"""
    if request.headers.get('X-Forwarded-For'):
        return request.headers.get('X-Forwarded-For').split(',')[0].strip()
    return request.remote_addr

3. Tiered Rate Limiting

Before: Same limit for everyone

max_requests=100  # Same for all users

After: Different limits based on customer tier

def get_rate_limit_tier(api_key):
    if not api_key:
        return {'requests': 100, 'window': 60, 'tier': 'anonymous'}
    tier_info = get_customer_tier(api_key)
    return {
        'requests': tier_info.get('rate_limit', 1000),  # Paid customers get more
        'window': 60,
        'tier': tier_info.get('plan', 'paid')
    }

4. Observability

Before: No logging or metrics

# Silent failure - no way to know what's happening

After: Structured logging for monitoring

logger.warning(
    f"Rate limit exceeded",
    extra={
        'identifier': identifier,
        'tier': limits['tier'],
        'requests': current_requests,
        'limit': limits['requests'],
        'endpoint': request.path
    }
)

Summary of changes:

Redis for state: Shared state across all servers, persists through deploys, automatic cleanup via TTL
Tiered rate limiting: Different limits for anonymous users, paid customers, enterprise clients
Correct client identification: X-Forwarded-For header with fallback
Atomic operations: Redis pipeline ensures race conditions can't bypass limits
Observability: Structured logging for monitoring and alerting
Better error messages: Include retry_after so clients know when to retry

💡 Key Takeaway: AI generates code that works in isolation but fails in production because it doesn't understand your infrastructure, constraints, or operational requirements.

What We Learned

This experience taught us three critical lessons:

1. AI Generates Code That Looks Correct But Misses Production Realities

The AI-generated code followed patterns, had error handling, and looked professional. But it didn't understand:

Our infrastructure (12 servers behind a load balancer)
Our operational requirements (persistence, observability)
Our business constraints (different limits for different customers)

The code worked in isolation but failed in production.

2. The "Boring" Code Is Where Bugs Live

Error handling, resource cleanup, observability—these are the "boring" parts that AI generates quickly. But they're also where most production bugs live. When you delegate this to AI without careful review, you're offloading the most critical parts.

3. You Still Need Deep Systems Understanding

LLMs can accelerate implementation, but they can't replace the reasoning that prevents disasters. You need to:

Understand your infrastructure deeply
Anticipate failure modes
Design for production, not just correctness

A Framework for Reviewing AI-Generated Code

After this incident, we developed a systematic checklist for reviewing AI-generated code:

Security Checklist

[ ] Input validation (can users inject malicious data?)
[ ] Authentication/authorization (are permissions checked?)
[ ] Sensitive data handling (is logging exposing secrets?)
[ ] SQL injection / XSS vulnerabilities

Production Readiness

[ ] Error handling (what happens when things fail?)
[ ] Resource cleanup (memory leaks, connection pools)
[ ] Observability (logging, metrics, tracing)
[ ] Performance (will this scale? N+1 queries?)

Context & Integration

[ ] Matches existing patterns (or breaks conventions?)
[ ] Uses correct libraries (or suggests deprecated ones?)
[ ] Fits architecture (or creates technical debt?)
[ ] Handles edge cases (null values, empty arrays, etc.)

Understanding

[ ] Do I understand every line?
[ ] Can I explain why this approach was chosen?
[ ] Do I know what will break if this changes?

The rule: If you can't confidently answer "yes" to all of these, don't merge it.

The Bottom Line

LLMs are powerful tools. They can generate syntactically correct code in seconds. But they can't understand your infrastructure, your constraints, or your operational requirements.

Use AI to accelerate implementation, but maintain ownership of the reasoning.

The code that looks perfect might have subtle bugs that only become obvious when you understand systems deeply. The "boring" code deserves extra scrutiny, not less.

And most importantly: if you can't explain why the code works, don't ship it.

What's your framework for reviewing AI-generated code? Share it in the comments below!

If you found this useful, I've documented this and 8 other production case studies in my book, Being a Software Developer After LLMs. It covers frameworks for working with LLMs strategically while maintaining your core engineering skills.

DEV Community