DEV Community

Benny
Benny

Posted on

FastPySGI-WSGI: How a Libuv-Powered Python Server Hits 7.5 Million Requests Per Second

Introduction

When most developers think of Python web performance, they think "slow." Frameworks like Flask and Django are beloved for developer experience, but rarely win benchmarking contests. FastPySGI-WSGI challenges that assumption entirely.

In the HttpArena benchmark suite -- a standardized HTTP framework benchmark platform running on dedicated 64-core hardware with 18 test profiles -- FastPySGI-WSGI delivers numbers that rival Rust and Go implementations. We're talking 1.3 million RPS on baseline tests and 707K RPS while processing JSON.

Let's break down how it works, why it's fast, and what lessons we can take away.


What Is FastPySGI?

FastPySGI is an ultra-fast WSGI/ASGI server for Python built on top of libuv -- the same C-based event loop that powers Node.js. Unlike traditional Python servers (Gunicorn, Uvicorn), FastPySGI bypasses Python's asyncio entirely and handles networking at the C level.

The "WSGI" variant specifically uses the standard WSGI interface, meaning it's synchronous Python code running on an asynchronous C event loop. This is a critical architectural choice: you get libuv's raw networking speed without requiring async/await in your application code.

Repository: https://github.com/remittor/fastpysgi


The Architecture: Fewer Layers, More Speed

Minimal Dependencies

The entire dependency list fits on a sticky note:

fastpysgi==0.4
orjson==3.10.15
psycopg[binary]==3.2.4
psycopg_pool==3.2.6
Enter fullscreen mode Exit fullscreen mode

Four packages. Compare that to FastAPI's 20+ transitive dependencies or Django's sprawling ecosystem. Every layer you remove is latency you eliminate.

Single-File Application

The entire benchmark implementation is a single 349-line Python file. No framework overhead, no middleware chains, no decorator magic. Just a WSGI callable:

def app(env, start_response):
    method = env["REQUEST_METHOD"]
    path = env["PATH_INFO"]

    if method not in ("GET", "POST"):
        return respond_405(start_response)

    if path == "/pipeline":
        return respond_ok(start_response)
    elif path == "/baseline11":
        return handle_baseline(env, start_response)
    # ... more routes
Enter fullscreen mode Exit fullscreen mode

Routing is a simple if/elif chain. No regex compilation, no route tree traversal, no parameter extraction framework. For a benchmark, this is the right call -- every nanosecond in routing overhead gets multiplied by millions of requests.

Multi-Worker Model

FastPySGI spawns one worker per available CPU core:

WRK_COUNT = min(len(os.sched_getaffinity(0)), 128)
Enter fullscreen mode Exit fullscreen mode

Each worker runs its own libuv event loop, and the OS distributes connections across them. On the benchmark's 64-core machine, that's 64 workers hammering through requests in parallel.


Performance-Critical Design Choices

1. Pre-Loaded Static Files

Static files aren't read from disk on each request. They're loaded entirely into memory at startup:

STATIC_DIR = "/data/static"
static_files = {}

for fname in os.listdir(STATIC_DIR):
    fpath = os.path.join(STATIC_DIR, fname)
    with open(fpath, "rb") as f:
        content = f.read()
    mime = get_mime_type(fname)
    static_files["/" + fname] = (content, mime)
Enter fullscreen mode Exit fullscreen mode

When a request hits /static/main.css, it's a dictionary lookup and a pointer return. Zero disk I/O, zero syscalls.

2. Fast JSON with orjson

The standard library json module is pure Python. FastPySGI uses orjson, a Rust-based JSON serializer that's 3-10x faster:

import orjson

body = orjson.dumps(result)
Enter fullscreen mode Exit fullscreen mode

For the JSON benchmark profile, this choice alone could account for hundreds of thousands of additional RPS.

3. Pre-Compressed Responses

For the compression test, the large JSON dataset is compressed once at startup, not on every request:

large_buf = orjson.dumps(large_dataset)
compressed = zlib.compress(large_buf, level=1)
Enter fullscreen mode Exit fullscreen mode

Every request gets the cached compressed buffer. No CPU burned on repeated gzip operations.

4. Tuned Server Parameters

Socket backlog and read buffers are explicitly tuned for high throughput:

fastpysgi.server.backlog = 16 * 1024      # 16K pending connections
fastpysgi.server.read_buffer_size = 256000  # 256KB read buffer
Enter fullscreen mode Exit fullscreen mode

These aren't arbitrary numbers -- they're sized for the benchmark's connection patterns (up to 16,384 concurrent connections).

5. Thread-Local SQLite with MMAP

Database tests use thread-local SQLite connections with memory-mapped I/O:

db_local = threading.local()

def get_db():
    if not hasattr(db_local, "conn"):
        conn = sqlite3.connect("/data/benchmark.db")
        conn.execute(f"PRAGMA mmap_size={268*1024*1024}")  # 268MB
        db_local.conn = conn
    return db_local.conn
Enter fullscreen mode Exit fullscreen mode

MMAP lets SQLite bypass the filesystem cache and read directly from memory-mapped pages, dramatically reducing query latency.

6. PostgreSQL Connection Pooling

For async database tests, a bounded connection pool prevents connection storm overhead:

from psycopg_pool import ConnectionPool

pool = ConnectionPool(
    conninfo="host=...",
    min_size=2,
    max_size=3
)
Enter fullscreen mode Exit fullscreen mode

The Benchmark Numbers

All benchmarks run on identical 64-core dedicated hardware via Docker containers, using h2load as the load generator with 64 threads. Duration: 5 seconds per run, best of 3 kept.

Baseline (Simple Response)

Connections RPS Avg Latency P99 Latency Memory
512 1,301,932 392us 2.00ms 408 MiB
4,096 1,371,836 2.99ms 33.80ms 922 MiB
16,384 1,324,561 11.93ms 60.10ms 2.5 GiB

Over 1.3 million requests per second on a simple response. Latency stays sub-millisecond at 512 connections.

JSON Processing

Connections RPS Avg Latency P99 Latency Bandwidth
4,096 707,282 4.56ms 17.20ms 5.63 GB/s
16,384 670,914 21.88ms 67.70ms 5.34 GB/s

707K RPS while parsing and serializing JSON, pushing 5.6 GB/s of bandwidth. The orjson investment pays off massively here.

Static File Serving

Connections RPS Avg Latency Bandwidth
4,096 724,526 3.00ms 10.91 GB/s

Nearly 11 GB/s of throughput from pre-loaded static files. Memory-resident serving eliminates disk I/O entirely.

Async Database (PostgreSQL)

Connections RPS Avg Latency P99 Latency Memory
1,024 79,200 12.16ms 31.10ms 1.0 GiB

Even with real PostgreSQL queries over the network, it sustains 79K RPS with reasonable latency.

Mixed Workload (Realistic Traffic)

Connections RPS Avg Latency P99 Latency Bandwidth
4,096 53,005 72.87ms 658.60ms 1.70 GB/s
16,384 48,546 312.08ms 2.09s 1.56 GB/s

The mixed test blends baseline, JSON, database, upload, and compression requests -- a more realistic workload. Still delivers 53K RPS.


How Does It Compare to Other Python Frameworks?

While exact apples-to-apples comparisons depend on the specific benchmark run, the architectural differences are telling:

Aspect FastPySGI-WSGI FastAPI Flask Django
Server Built-in (libuv) Uvicorn (asyncio) Gunicorn (prefork) Gunicorn (prefork)
Event Loop libuv (C) uvloop (Python) None None
Dependencies 4 20+ 8+ 15+
Routing if/elif chain Decorator + Starlette Decorator + Werkzeug URL patterns + ORM
JSON orjson (Rust) stdlib json stdlib json stdlib json

The key insight: FastPySGI removes Python from the hot path of networking. The event loop, connection handling, and buffer management all happen in C (libuv). Python only runs for application logic -- routing, data processing, response building.


Lessons for Your Own Projects

You probably shouldn't rewrite your production Flask app as a raw WSGI handler. But there are transferable lessons:

1. Know Your Bottleneck

FastPySGI proves that Python application code isn't usually the bottleneck -- it's the layers between the OS and your code. If you're I/O bound, the event loop implementation matters more than your language choice.

2. Pre-compute What You Can

Pre-loading static files, pre-compressing responses, and pre-serializing datasets at startup are techniques that work in any framework. If data doesn't change per-request, don't process it per-request.

3. Choose Your Serializer Wisely

Swapping json for orjson is a one-line change in most Python projects and can yield 3-10x faster serialization. For API-heavy services, this is low-hanging fruit.

4. Tune Your Server Parameters

Most developers never touch socket backlog, buffer sizes, or connection pool bounds. The defaults are conservative. If you know your traffic patterns, tuning these can unlock significant performance.

5. Fewer Dependencies = Fewer Layers

Every middleware, every abstraction, every framework feature adds overhead. When performance matters, audit your dependency tree and question whether each layer is earning its keep.


Conclusion

FastPySGI-WSGI demonstrates that Python can compete at the highest levels of HTTP performance when you strip away the abstractions and let C do what C does best. By building on libuv, minimizing dependencies, and making smart caching decisions, it achieves numbers that most developers would associate with Rust or Go.

The HttpArena project (https://www.http-arena.com/) provides a fascinating lens into how different frameworks and languages approach the same problems. FastPySGI-WSGI stands out not because it reinvents Python, but because it strategically removes Python from the parts of the stack where it's slowest.

Whether you're building the next high-performance Python server or just optimizing your existing API, the principles behind FastPySGI's design are worth studying.


All benchmark data from HttpArena, run on dedicated 64-core hardware with standardized Docker containers. Results reflect framework performance under controlled conditions.

Check out the HttpArena repository on GitHub to explore how 78+ frameworks compare: https://github.com/MDA2AV/HttpArena

Top comments (0)