DEV Community

TAKUYA HIRATA
TAKUYA HIRATA

Posted on

How I Built a FastAPI Backend That Handles 10K Requests/sec on a $5 Server

Disclosure: This post contains affiliate links. We may earn a small commission at no extra cost to you.

TL;DR: You don't need expensive infrastructure to build a high-performance Python API. With FastAPI, async SQLAlchemy, connection pooling, and a few key optimizations, I'm handling production traffic on a $5/month server. Here's the exact configuration and the patterns that make it work.


Most FastAPI tutorials stop at "hello world." They show you how to define a route, return some JSON, and call it a day. But what happens when real traffic hits your app?

I spent the last year building AEGIS, a multi-agent AI orchestration system. The backend is a FastAPI service handling authentication, agent messaging, workflow execution, and real-time coordination across 16 AI agents. Here's how I optimized it to handle 10,000+ requests per second on minimal infrastructure.

Why Does FastAPI Performance Matter?

FastAPI is already one of the fastest Python frameworks — but the framework is only part of the equation. Your database connections, middleware stack, and deployment configuration have a bigger impact on real-world throughput than the framework itself.

The difference between a naive FastAPI setup and an optimized one isn't 10-20%. It's often 5-10x.

How Should You Configure Uvicorn Workers?

The single biggest performance lever is your ASGI server configuration. Most developers run uvicorn main:app and wonder why their API crawls under load.

Here's the configuration I use in production:

uvicorn app.main:app \
  --host 0.0.0.0 \
  --port 8000 \
  --workers 4 \
  --loop uvloop \
  --http httptools \
  --limit-concurrency 1000 \
  --timeout-keep-alive 30
Enter fullscreen mode Exit fullscreen mode

Key settings explained:

  • --workers 4: One worker per CPU core. On a 2-core $5 server, use 2-4 workers. More than that causes context switching overhead.
  • --loop uvloop: Drop-in replacement for asyncio's event loop. 2-4x faster for I/O-bound workloads.
  • --http httptools: C-based HTTP parser. 20-30% faster than the default Python parser.
  • --limit-concurrency 1000: Prevents resource exhaustion under spike traffic.

Install the dependencies:

pip install uvloop httptools
Enter fullscreen mode Exit fullscreen mode

This single change — adding uvloop and httptools — typically doubles your throughput with zero code changes.

What's the Right Way to Handle Database Connections?

Database connections are the most common bottleneck in async Python APIs. If you're creating a new connection per request, you're leaving 80% of your performance on the table.

Here's the async SQLAlchemy 2.0 pattern I use:

from sqlalchemy.ext.asyncio import (
    create_async_engine,
    AsyncSession,
    async_sessionmaker,
)

engine = create_async_engine(
    "postgresql+asyncpg://user:pass@localhost/mydb",
    pool_size=20,          # Persistent connections in the pool
    max_overflow=10,       # Extra connections under burst load
    pool_pre_ping=True,    # Verify connections are alive
    pool_recycle=3600,     # Recycle connections every hour
    echo=False,            # Disable SQL logging in production
)

async_session_maker = async_sessionmaker(
    engine,
    class_=AsyncSession,
    expire_on_commit=False,  # Prevent lazy-load after commit
)

async def get_db() -> AsyncGenerator[AsyncSession, None]:
    async with async_session_maker() as session:
        yield session
Enter fullscreen mode Exit fullscreen mode

Why these numbers?

  • pool_size=20: Matches the default PostgreSQL max_connections divided by your worker count. For 4 workers: 100 / 4 = 25, so 20 is a safe baseline.
  • max_overflow=10: Allows 30 total connections per worker during traffic spikes, then releases them back.
  • pool_pre_ping=True: Costs ~1ms per connection checkout but prevents "connection closed" errors that crash your entire request.
  • expire_on_commit=False: Critical for async. Without this, accessing any attribute after commit() triggers a synchronous lazy load that blocks your event loop.

How Do You Build a Middleware Stack That Doesn't Kill Performance?

Middleware runs on every single request. A poorly ordered stack can add 50-100ms of overhead. Here's my production middleware stack in the right order:

from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
import time

app = FastAPI()

# 1. CORS — must be outermost to handle preflight requests
app.add_middleware(
    CORSMiddleware,
    allow_origins=["https://your-domain.com"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

# 2. Security headers — cheap, runs on every response
app.add_middleware(SecurityHeadersMiddleware)

# 3. Request timing — measure everything inside
@app.middleware("http")
async def add_timing(request, call_next):
    start = time.perf_counter()
    response = await call_next(request)
    response.headers["X-Process-Time"] = f"{time.perf_counter() - start:.4f}"
    return response

# 4. Rate limiting — reject bad traffic early
app.middleware("http")(rate_limit_middleware)
Enter fullscreen mode Exit fullscreen mode

Order matters: Rate limiting should be as early as possible to reject abusive traffic before it hits your database. CORS must be outermost because browsers send OPTIONS preflight requests that need to be handled before anything else.

How Should You Implement Rate Limiting?

In-memory rate limiting works for single-server deployments. For multiple workers, use Redis:

import redis.asyncio as redis
from datetime import datetime

class RedisRateLimiter:
    def __init__(self, redis_url: str, rpm: int = 100):
        self.redis = redis.from_url(redis_url)
        self.rpm = rpm

    async def is_allowed(self, client_ip: str) -> bool:
        key = f"rate:{client_ip}:{datetime.now().strftime('%Y%m%d%H%M')}"
        count = await self.redis.incr(key)
        if count == 1:
            await self.redis.expire(key, 60)
        return count <= self.rpm
Enter fullscreen mode Exit fullscreen mode

Redis-based rate limiting adds ~1ms per request but works correctly across all Uvicorn workers. The sliding window pattern using minute-based keys is simple and effective.

What Does the Deployment Configuration Look Like?

Here's my docker-compose.yml for production:

services:
  api:
    build: .
    command: >
      uvicorn app.main:app
      --host 0.0.0.0 --port 8000
      --workers 2 --loop uvloop --http httptools
    environment:
      - DATABASE_URL=postgresql+asyncpg://user:pass@db/aegis
      - REDIS_URL=redis://redis:6379
    depends_on:
      db:
        condition: service_healthy
    deploy:
      resources:
        limits:
          memory: 512M

  db:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: aegis
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 5s

  redis:
    image: redis:7-alpine
    command: redis-server --maxmemory 64mb --maxmemory-policy allkeys-lru
Enter fullscreen mode Exit fullscreen mode

The total memory footprint: ~512MB for the API, ~256MB for PostgreSQL, ~64MB for Redis. This runs comfortably on a 1GB $5 DigitalOcean Droplet.

What Results Can You Expect?

Here are benchmarks from my production setup using wrk:

wrk -t4 -c100 -d30s http://localhost:8000/health

Running 30s test @ http://localhost:8000/health
  4 threads and 100 connections
  Thread Stats   Avg      Stdev     Max
    Latency     2.34ms    1.12ms   28.43ms
    Req/Sec     2.68k   312.45     3.89k
  321,024 requests in 30.02s, 48.15MB read
Requests/sec: 10,693.47
Enter fullscreen mode Exit fullscreen mode

For database-backed endpoints (with connection pooling):

wrk -t4 -c100 -d30s http://localhost:8000/api/v1/agents

  Latency     8.72ms    3.41ms   45.21ms
  Req/Sec     742.18    89.32     1.02k
Requests/sec: 2,965.82
Enter fullscreen mode Exit fullscreen mode

That's ~10K req/sec for lightweight endpoints and ~3K req/sec for database queries — all on a $5 server.

Key Takeaways

  1. Use uvloop + httptools: Zero-code 2x throughput improvement.
  2. Configure connection pooling properly: pool_size, max_overflow, and pool_pre_ping are non-negotiable.
  3. Order your middleware stack intentionally: Reject bad traffic early, measure everything.
  4. Match workers to cores: 1-2 workers per CPU core. More is worse.
  5. Set expire_on_commit=False: The async SQLAlchemy gotcha that blocks your event loop.
  6. Profile before optimizing: Use the X-Process-Time header to measure every endpoint.

Useful Resources

If you're deploying FastAPI to production, these tools will save you hours of debugging:

  • DigitalOcean ($200 free credit) — Their $5 Droplet is where I run this exact stack. The managed PostgreSQL option ($15/mo) is worth it if you don't want to manage backups yourself.
  • Railway — One-click FastAPI deployment with auto-scaling. Great for staging environments.
  • Render — Free tier with Docker support. I use it for preview deployments on pull requests.

The patterns in this article aren't theoretical. They're running in production right now, handling real traffic for a system with 16 AI agents, 12 API routers, and a multi-tenant architecture with row-level security. Start with the Uvicorn configuration, add connection pooling, and you'll be surprised how far a $5 server can take you.


Stay Updated

I publish deep-dive technical articles 5x/week on AI agents, Python architecture, and developer tooling. Follow me here on dev.to or subscribe to the newsletter to get them in your inbox.


This article was generated with AI assistance and reviewed for accuracy. If you found it helpful, consider supporting the author:

Buy Me A Coffee

Top comments (0)