Introduction
When most developers think of Python web performance, they think "slow." Frameworks like Flask and Django are beloved for developer experience, but rarely win benchmarking contests. FastPySGI-WSGI challenges that assumption entirely.
In the HttpArena benchmark suite -- a standardized HTTP framework benchmark platform running on dedicated 64-core hardware with 18 test profiles -- FastPySGI-WSGI delivers numbers that rival Rust and Go implementations. We're talking 1.3 million RPS on baseline tests and 707K RPS while processing JSON.
Let's break down how it works, why it's fast, and what lessons we can take away.
What Is FastPySGI?
FastPySGI is an ultra-fast WSGI/ASGI server for Python built on top of libuv -- the same C-based event loop that powers Node.js. Unlike traditional Python servers (Gunicorn, Uvicorn), FastPySGI bypasses Python's asyncio entirely and handles networking at the C level.
The "WSGI" variant specifically uses the standard WSGI interface, meaning it's synchronous Python code running on an asynchronous C event loop. This is a critical architectural choice: you get libuv's raw networking speed without requiring async/await in your application code.
Repository: https://github.com/remittor/fastpysgi
The Architecture: Fewer Layers, More Speed
Minimal Dependencies
The entire dependency list fits on a sticky note:
fastpysgi==0.4
orjson==3.10.15
psycopg[binary]==3.2.4
psycopg_pool==3.2.6
Four packages. Compare that to FastAPI's 20+ transitive dependencies or Django's sprawling ecosystem. Every layer you remove is latency you eliminate.
Single-File Application
The entire benchmark implementation is a single 349-line Python file. No framework overhead, no middleware chains, no decorator magic. Just a WSGI callable:
def app(env, start_response):
method = env["REQUEST_METHOD"]
path = env["PATH_INFO"]
if method not in ("GET", "POST"):
return respond_405(start_response)
if path == "/pipeline":
return respond_ok(start_response)
elif path == "/baseline11":
return handle_baseline(env, start_response)
# ... more routes
Routing is a simple if/elif chain. No regex compilation, no route tree traversal, no parameter extraction framework. For a benchmark, this is the right call -- every nanosecond in routing overhead gets multiplied by millions of requests.
Multi-Worker Model
FastPySGI spawns one worker per available CPU core:
WRK_COUNT = min(len(os.sched_getaffinity(0)), 128)
Each worker runs its own libuv event loop, and the OS distributes connections across them. On the benchmark's 64-core machine, that's 64 workers hammering through requests in parallel.
Performance-Critical Design Choices
1. Pre-Loaded Static Files
Static files aren't read from disk on each request. They're loaded entirely into memory at startup:
STATIC_DIR = "/data/static"
static_files = {}
for fname in os.listdir(STATIC_DIR):
fpath = os.path.join(STATIC_DIR, fname)
with open(fpath, "rb") as f:
content = f.read()
mime = get_mime_type(fname)
static_files["/" + fname] = (content, mime)
When a request hits /static/main.css, it's a dictionary lookup and a pointer return. Zero disk I/O, zero syscalls.
2. Fast JSON with orjson
The standard library json module is pure Python. FastPySGI uses orjson, a Rust-based JSON serializer that's 3-10x faster:
import orjson
body = orjson.dumps(result)
For the JSON benchmark profile, this choice alone could account for hundreds of thousands of additional RPS.
3. Pre-Compressed Responses
For the compression test, the large JSON dataset is compressed once at startup, not on every request:
large_buf = orjson.dumps(large_dataset)
compressed = zlib.compress(large_buf, level=1)
Every request gets the cached compressed buffer. No CPU burned on repeated gzip operations.
4. Tuned Server Parameters
Socket backlog and read buffers are explicitly tuned for high throughput:
fastpysgi.server.backlog = 16 * 1024 # 16K pending connections
fastpysgi.server.read_buffer_size = 256000 # 256KB read buffer
These aren't arbitrary numbers -- they're sized for the benchmark's connection patterns (up to 16,384 concurrent connections).
5. Thread-Local SQLite with MMAP
Database tests use thread-local SQLite connections with memory-mapped I/O:
db_local = threading.local()
def get_db():
if not hasattr(db_local, "conn"):
conn = sqlite3.connect("/data/benchmark.db")
conn.execute(f"PRAGMA mmap_size={268*1024*1024}") # 268MB
db_local.conn = conn
return db_local.conn
MMAP lets SQLite bypass the filesystem cache and read directly from memory-mapped pages, dramatically reducing query latency.
6. PostgreSQL Connection Pooling
For async database tests, a bounded connection pool prevents connection storm overhead:
from psycopg_pool import ConnectionPool
pool = ConnectionPool(
conninfo="host=...",
min_size=2,
max_size=3
)
The Benchmark Numbers
All benchmarks run on identical 64-core dedicated hardware via Docker containers, using h2load as the load generator with 64 threads. Duration: 5 seconds per run, best of 3 kept.
Baseline (Simple Response)
| Connections | RPS | Avg Latency | P99 Latency | Memory |
|---|---|---|---|---|
| 512 | 1,301,932 | 392us | 2.00ms | 408 MiB |
| 4,096 | 1,371,836 | 2.99ms | 33.80ms | 922 MiB |
| 16,384 | 1,324,561 | 11.93ms | 60.10ms | 2.5 GiB |
Over 1.3 million requests per second on a simple response. Latency stays sub-millisecond at 512 connections.
JSON Processing
| Connections | RPS | Avg Latency | P99 Latency | Bandwidth |
|---|---|---|---|---|
| 4,096 | 707,282 | 4.56ms | 17.20ms | 5.63 GB/s |
| 16,384 | 670,914 | 21.88ms | 67.70ms | 5.34 GB/s |
707K RPS while parsing and serializing JSON, pushing 5.6 GB/s of bandwidth. The orjson investment pays off massively here.
Static File Serving
| Connections | RPS | Avg Latency | Bandwidth |
|---|---|---|---|
| 4,096 | 724,526 | 3.00ms | 10.91 GB/s |
Nearly 11 GB/s of throughput from pre-loaded static files. Memory-resident serving eliminates disk I/O entirely.
Async Database (PostgreSQL)
| Connections | RPS | Avg Latency | P99 Latency | Memory |
|---|---|---|---|---|
| 1,024 | 79,200 | 12.16ms | 31.10ms | 1.0 GiB |
Even with real PostgreSQL queries over the network, it sustains 79K RPS with reasonable latency.
Mixed Workload (Realistic Traffic)
| Connections | RPS | Avg Latency | P99 Latency | Bandwidth |
|---|---|---|---|---|
| 4,096 | 53,005 | 72.87ms | 658.60ms | 1.70 GB/s |
| 16,384 | 48,546 | 312.08ms | 2.09s | 1.56 GB/s |
The mixed test blends baseline, JSON, database, upload, and compression requests -- a more realistic workload. Still delivers 53K RPS.
How Does It Compare to Other Python Frameworks?
While exact apples-to-apples comparisons depend on the specific benchmark run, the architectural differences are telling:
| Aspect | FastPySGI-WSGI | FastAPI | Flask | Django |
|---|---|---|---|---|
| Server | Built-in (libuv) | Uvicorn (asyncio) | Gunicorn (prefork) | Gunicorn (prefork) |
| Event Loop | libuv (C) | uvloop (Python) | None | None |
| Dependencies | 4 | 20+ | 8+ | 15+ |
| Routing | if/elif chain | Decorator + Starlette | Decorator + Werkzeug | URL patterns + ORM |
| JSON | orjson (Rust) | stdlib json | stdlib json | stdlib json |
The key insight: FastPySGI removes Python from the hot path of networking. The event loop, connection handling, and buffer management all happen in C (libuv). Python only runs for application logic -- routing, data processing, response building.
Lessons for Your Own Projects
You probably shouldn't rewrite your production Flask app as a raw WSGI handler. But there are transferable lessons:
1. Know Your Bottleneck
FastPySGI proves that Python application code isn't usually the bottleneck -- it's the layers between the OS and your code. If you're I/O bound, the event loop implementation matters more than your language choice.
2. Pre-compute What You Can
Pre-loading static files, pre-compressing responses, and pre-serializing datasets at startup are techniques that work in any framework. If data doesn't change per-request, don't process it per-request.
3. Choose Your Serializer Wisely
Swapping json for orjson is a one-line change in most Python projects and can yield 3-10x faster serialization. For API-heavy services, this is low-hanging fruit.
4. Tune Your Server Parameters
Most developers never touch socket backlog, buffer sizes, or connection pool bounds. The defaults are conservative. If you know your traffic patterns, tuning these can unlock significant performance.
5. Fewer Dependencies = Fewer Layers
Every middleware, every abstraction, every framework feature adds overhead. When performance matters, audit your dependency tree and question whether each layer is earning its keep.
Conclusion
FastPySGI-WSGI demonstrates that Python can compete at the highest levels of HTTP performance when you strip away the abstractions and let C do what C does best. By building on libuv, minimizing dependencies, and making smart caching decisions, it achieves numbers that most developers would associate with Rust or Go.
The HttpArena project (https://www.http-arena.com/) provides a fascinating lens into how different frameworks and languages approach the same problems. FastPySGI-WSGI stands out not because it reinvents Python, but because it strategically removes Python from the parts of the stack where it's slowest.
Whether you're building the next high-performance Python server or just optimizing your existing API, the principles behind FastPySGI's design are worth studying.
All benchmark data from HttpArena, run on dedicated 64-core hardware with standardized Docker containers. Results reflect framework performance under controlled conditions.
Check out the HttpArena repository on GitHub to explore how 78+ frameworks compare: https://github.com/MDA2AV/HttpArena
Top comments (0)