Benny

Posted on Mar 31

FastPySGI-WSGI: How a Libuv-Powered Python Server Hits 7.5 Million Requests Per Second

#backend #performance #python #webdev

Introduction

When most developers think of Python web performance, they think "slow." Frameworks like Flask and Django are beloved for developer experience, but rarely win benchmarking contests. FastPySGI-WSGI challenges that assumption entirely.

In the HttpArena benchmark suite -- a standardized HTTP framework benchmark platform running on dedicated 64-core hardware with 18 test profiles -- FastPySGI-WSGI delivers numbers that rival Rust and Go implementations. We're talking 1.3 million RPS on baseline tests and 707K RPS while processing JSON.

Let's break down how it works, why it's fast, and what lessons we can take away.

What Is FastPySGI?

FastPySGI is an ultra-fast WSGI/ASGI server for Python built on top of libuv -- the same C-based event loop that powers Node.js. Unlike traditional Python servers (Gunicorn, Uvicorn), FastPySGI bypasses Python's asyncio entirely and handles networking at the C level.

The "WSGI" variant specifically uses the standard WSGI interface, meaning it's synchronous Python code running on an asynchronous C event loop. This is a critical architectural choice: you get libuv's raw networking speed without requiring async/await in your application code.

Repository: https://github.com/remittor/fastpysgi

The Architecture: Fewer Layers, More Speed

Minimal Dependencies

The entire dependency list fits on a sticky note:

fastpysgi==0.4
orjson==3.10.15
psycopg[binary]==3.2.4
psycopg_pool==3.2.6

Four packages. Compare that to FastAPI's 20+ transitive dependencies or Django's sprawling ecosystem. Every layer you remove is latency you eliminate.

Single-File Application

The entire benchmark implementation is a single 349-line Python file. No framework overhead, no middleware chains, no decorator magic. Just a WSGI callable:

def app(env, start_response):
    method = env["REQUEST_METHOD"]
    path = env["PATH_INFO"]

    if method not in ("GET", "POST"):
        return respond_405(start_response)

    if path == "/pipeline":
        return respond_ok(start_response)
    elif path == "/baseline11":
        return handle_baseline(env, start_response)
    # ... more routes

Routing is a simple if/elif chain. No regex compilation, no route tree traversal, no parameter extraction framework. For a benchmark, this is the right call -- every nanosecond in routing overhead gets multiplied by millions of requests.

Multi-Worker Model

FastPySGI spawns one worker per available CPU core:

WRK_COUNT = min(len(os.sched_getaffinity(0)), 128)

Each worker runs its own libuv event loop, and the OS distributes connections across them. On the benchmark's 64-core machine, that's 64 workers hammering through requests in parallel.

Performance-Critical Design Choices

1. Pre-Loaded Static Files

Static files aren't read from disk on each request. They're loaded entirely into memory at startup:

STATIC_DIR = "/data/static"
static_files = {}

for fname in os.listdir(STATIC_DIR):
    fpath = os.path.join(STATIC_DIR, fname)
    with open(fpath, "rb") as f:
        content = f.read()
    mime = get_mime_type(fname)
    static_files["/" + fname] = (content, mime)

When a request hits /static/main.css, it's a dictionary lookup and a pointer return. Zero disk I/O, zero syscalls.

2. Fast JSON with orjson

The standard library json module is pure Python. FastPySGI uses orjson, a Rust-based JSON serializer that's 3-10x faster:

import orjson

body = orjson.dumps(result)

For the JSON benchmark profile, this choice alone could account for hundreds of thousands of additional RPS.

3. Pre-Compressed Responses

For the compression test, the large JSON dataset is compressed once at startup, not on every request:

large_buf = orjson.dumps(large_dataset)
compressed = zlib.compress(large_buf, level=1)

Every request gets the cached compressed buffer. No CPU burned on repeated gzip operations.

4. Tuned Server Parameters

Socket backlog and read buffers are explicitly tuned for high throughput:

fastpysgi.server.backlog = 16 * 1024      # 16K pending connections
fastpysgi.server.read_buffer_size = 256000  # 256KB read buffer

These aren't arbitrary numbers -- they're sized for the benchmark's connection patterns (up to 16,384 concurrent connections).

5. Thread-Local SQLite with MMAP

Database tests use thread-local SQLite connections with memory-mapped I/O:

db_local = threading.local()

def get_db():
    if not hasattr(db_local, "conn"):
        conn = sqlite3.connect("/data/benchmark.db")
        conn.execute(f"PRAGMA mmap_size={268*1024*1024}")  # 268MB
        db_local.conn = conn
    return db_local.conn

MMAP lets SQLite bypass the filesystem cache and read directly from memory-mapped pages, dramatically reducing query latency.

6. PostgreSQL Connection Pooling

For async database tests, a bounded connection pool prevents connection storm overhead:

from psycopg_pool import ConnectionPool

pool = ConnectionPool(
    conninfo="host=...",
    min_size=2,
    max_size=3
)

The Benchmark Numbers

All benchmarks run on identical 64-core dedicated hardware via Docker containers, using h2load as the load generator with 64 threads. Duration: 5 seconds per run, best of 3 kept.

Baseline (Simple Response)

Connections	RPS	Avg Latency	P99 Latency	Memory
512	1,301,932	392us	2.00ms	408 MiB
4,096	1,371,836	2.99ms	33.80ms	922 MiB
16,384	1,324,561	11.93ms	60.10ms	2.5 GiB

Over 1.3 million requests per second on a simple response. Latency stays sub-millisecond at 512 connections.

JSON Processing

Connections	RPS	Avg Latency	P99 Latency	Bandwidth
4,096	707,282	4.56ms	17.20ms	5.63 GB/s
16,384	670,914	21.88ms	67.70ms	5.34 GB/s

707K RPS while parsing and serializing JSON, pushing 5.6 GB/s of bandwidth. The orjson investment pays off massively here.

Static File Serving

Connections	RPS	Avg Latency	Bandwidth
4,096	724,526	3.00ms	10.91 GB/s

Nearly 11 GB/s of throughput from pre-loaded static files. Memory-resident serving eliminates disk I/O entirely.

Async Database (PostgreSQL)

Connections	RPS	Avg Latency	P99 Latency	Memory
1,024	79,200	12.16ms	31.10ms	1.0 GiB

Even with real PostgreSQL queries over the network, it sustains 79K RPS with reasonable latency.

Mixed Workload (Realistic Traffic)

Connections	RPS	Avg Latency	P99 Latency	Bandwidth
4,096	53,005	72.87ms	658.60ms	1.70 GB/s
16,384	48,546	312.08ms	2.09s	1.56 GB/s

The mixed test blends baseline, JSON, database, upload, and compression requests -- a more realistic workload. Still delivers 53K RPS.

How Does It Compare to Other Python Frameworks?

While exact apples-to-apples comparisons depend on the specific benchmark run, the architectural differences are telling:

Aspect	FastPySGI-WSGI	FastAPI	Flask	Django
Server	Built-in (libuv)	Uvicorn (asyncio)	Gunicorn (prefork)	Gunicorn (prefork)
Event Loop	libuv (C)	uvloop (Python)	None	None
Dependencies	4	20+	8+	15+
Routing	if/elif chain	Decorator + Starlette	Decorator + Werkzeug	URL patterns + ORM
JSON	orjson (Rust)	stdlib json	stdlib json	stdlib json

The key insight: FastPySGI removes Python from the hot path of networking. The event loop, connection handling, and buffer management all happen in C (libuv). Python only runs for application logic -- routing, data processing, response building.

Lessons for Your Own Projects

You probably shouldn't rewrite your production Flask app as a raw WSGI handler. But there are transferable lessons:

1. Know Your Bottleneck

FastPySGI proves that Python application code isn't usually the bottleneck -- it's the layers between the OS and your code. If you're I/O bound, the event loop implementation matters more than your language choice.

2. Pre-compute What You Can

Pre-loading static files, pre-compressing responses, and pre-serializing datasets at startup are techniques that work in any framework. If data doesn't change per-request, don't process it per-request.

3. Choose Your Serializer Wisely

Swapping json for orjson is a one-line change in most Python projects and can yield 3-10x faster serialization. For API-heavy services, this is low-hanging fruit.

4. Tune Your Server Parameters

Most developers never touch socket backlog, buffer sizes, or connection pool bounds. The defaults are conservative. If you know your traffic patterns, tuning these can unlock significant performance.

5. Fewer Dependencies = Fewer Layers

Every middleware, every abstraction, every framework feature adds overhead. When performance matters, audit your dependency tree and question whether each layer is earning its keep.

Conclusion

FastPySGI-WSGI demonstrates that Python can compete at the highest levels of HTTP performance when you strip away the abstractions and let C do what C does best. By building on libuv, minimizing dependencies, and making smart caching decisions, it achieves numbers that most developers would associate with Rust or Go.

The HttpArena project (https://www.http-arena.com/) provides a fascinating lens into how different frameworks and languages approach the same problems. FastPySGI-WSGI stands out not because it reinvents Python, but because it strategically removes Python from the parts of the stack where it's slowest.

Whether you're building the next high-performance Python server or just optimizing your existing API, the principles behind FastPySGI's design are worth studying.

All benchmark data from HttpArena, run on dedicated 64-core hardware with standardized Docker containers. Results reflect framework performance under controlled conditions.

Check out the HttpArena repository on GitHub to explore how 78+ frameworks compare: https://github.com/MDA2AV/HttpArena

DEV Community