John Napiorkowski

Posted on Jan 1

PAGI::Server: Stress-Testing an Async Perl HTTP Server the Hard Way

#architecture #backend #testing #performance

Over the last few days I’ve been stress-testing PAGI::Server, an async HTTP server built on IO::Async, Future::IO, and an EV event loop. The goal wasn’t to chase synthetic “requests per second” numbers, but to answer harder questions:

Does streaming actually behave correctly?
What happens under sustained concurrency?
Do slow clients cause buffer bloat?
Does backpressure really work?
How does it fail under overload?

This post summarizes what I learned by pushing the server well past “normal” workloads and watching it misbehave — or not.

Architecture in one paragraph

PAGI::Server is:

single-process, event-driven by default
optionally multi-process via workers
fully async end-to-end (no threads)
streaming-first (responses are not buffered by default)
designed to support ASGI-like semantics in Perl

The server accepts connections in a main event loop and routes work through IO::Async::Stream objects. Application code produces responses via a $send callback that emits protocol events (http.response.start, http.response.body, etc.).

That last part matters, because it’s where most async servers quietly cheat.

An important note: Although PAGI::Server is IO::Async based, that is an implementation detail, not a core requirement of the PAGI specification. That said I build PAGI::Server on IO::Async because it's mature, battle tested and fully featured.

Baseline: fast responses

Before touching streaming, I tested a trivial HTTP handler.

On an older 16-core MacBook Pro:

~22–23k requests/sec
p50 latency ~20ms
p99 < 50ms
CPU mostly idle

Nothing surprising here — just confirming there’s no accidental serialization or blocking in the accept loop.

Just to set some baselines, I tested a simple, similar PSGI application using Starman, one of the most popular choices for serving PSGI apps, and I found it topped out at around 16K requests/sec, with significantly worse latency spread and dropped connections.

Streaming test: 2-second responses, 500 concurrent clients

Next, I switched to a streaming handler that sends chunks over ~2 seconds.

With 500 concurrent connections, single process:

~242–245 requests/sec
p50 ≈ 2.00s
p99 ≈ 2.3–2.4s
CPU ~10–20% busy
memory stable

This matters because the math checks out:

500 concurrent / 2 seconds ≈ 250 rps theoretical max

Hitting ~97–98% of theoretical max strongly suggests:

no head-of-line blocking
no per-connection threads
no accidental buffering

The server is doing what an async server should do: waiting, not working.

A gotcha: TTY logging will ruin your benchmark

One early run looked worse than expected. The cause was embarrassingly simple:

Access logging was printing to a terminal (TTY).

At ~20k+ log lines/sec, the terminal became the bottleneck.

Once logging was disabled or redirected:

CPU usage dropped dramatically
latency tails tightened
throughput returned to theoretical limits

Lesson: never benchmark with synchronous TTY logging enabled.

Overload behavior: what happens at 2500 concurrent streams?

With ulimit -n = 2048, single-process PAGI::Server started rejecting connections around ~2000 open sockets.

That rejection path returned:

503 Service Unavailable
Retry-After: 5
Connection: close

This is good behavior.

Instead of accepting everything and melting down, the server applied admission control based on available file descriptors.

The client (hey) complained with “unsolicited response” warnings — a known artifact when clients aggressively reuse keep-alive connections under churn. The server was behaving correctly.

Scaling out: 16 workers

With 16 worker processes, the same workload:

sustained ~1200 requests/sec at 2500 concurrency
p50 ≈ 2.00s
p99 ≈ 3.0s
all 200 responses, no rejects

Again, the math checks:

2500 concurrent / 2 seconds ≈ 1250 rps theoretical

Result: ~96% of theoretical max.

This validated that PAGI’s worker model increases capacity cleanly without breaking streaming semantics.

The real test: slow clients and backpressure

Throughput benchmarks are easy. Backpressure is not.

To test this, I used a stress app that streams tens of megabytes as fast as possible, then ran clients with artificial throttling.

Slow client test

curl --limit-rate 1M http://localhost:5000/stream/50

Observed:

~50MB transferred in ~49 seconds
download rate stayed near 1MB/s
server did not buffer the entire response
no memory blow-up

This is the critical signal.

If backpressure were broken, the server would buffer 50MB instantly and curl would “catch up” later. That did not happen.

High-throughput clients

Unthrottled clients on localhost achieved:

~75–80 MB/s per connection
linear scaling until loopback bandwidth became the limit

Again, no red flags.

Concurrency + large bodies

Using hey with 50 concurrent clients streaming 10–50MB bodies:

high aggregate throughput
wide latency distribution
no error rates
no collapse

At this point, the bottleneck was clearly:

kernel socket buffers
memory copy bandwidth
client tooling limits

Not the server itself.

The subtle risk we identified (and addressed)

While reviewing the server code, one important detail surfaced:

writes use IO::Async::Stream->write
writes enqueue data into an outgoing buffer
there is no implicit send-side backpressure unless implemented explicitly

This is fine for polite producers (like most apps), but dangerous if:

an app writes huge chunks rapidly
the client reads slowly

To address this, we added send-side backpressure:

track queued outgoing bytes
define high / low watermarks
pause $send->(...) futures when buffers exceed the high watermark
resume when the buffer drains

This change makes slow-client behavior safe by default.

What I did not see (important)

Across all tests, I did not observe:

unbounded RSS growth
runaway CPU
increasing latency under steady load
dropped connections without errors
event-loop starvation

Those are the usual failure modes of async servers. None appeared.

Final assessment

Based on these tests:

PAGI::Server handles streaming correctly
backpressure works (and is now explicit)
overload fails fast and cleanly
worker scaling behaves predictably
performance matches theoretical limits

There’s still more to do — especially long-duration soak tests and production-grade observability — but the fundamentals are solid.

Most importantly, the server behaves honestly. It doesn’t fake async by buffering everything and hoping for the best.

Top comments (1)

bbrtj • Jan 3

Thank you for your hard work, I appreciate it :)