SEN LLC

Posted on Apr 16

Writing an HTTP Load Tester That Doesn't Lie About p99

#rust #benchmark #http #tutorial

Writing an HTTP Load Tester That Doesn't Lie About p99

http-bench: a small Rust CLI that fires HTTP requests at a target for a fixed duration or count, reports RPS, latency percentiles, and error breakdown. About 900 lines, five dependencies, 11.6 MB container.

There are already several good HTTP load testers — wrk, hey, oha, vegeta, bombardier. I built another one anyway, for reasons that will hopefully be interesting, and for one reason that is maybe just personal taste.

📦 GitHub: https://github.com/sen-ltd/http-bench

The gap I kept hitting

My "correct" answer for load testing is still wrk. It is fast, the LuaJIT scripting hook is very powerful, and it has earned the right to be the default. The problem is it's C and it is surprisingly painful to get onto a fresh machine: a brand-new Alpine container, a freshly reinstalled macOS, a Debian box I SSH'd into and don't own. I want a load tester I can docker run in under ten seconds.

hey solves the "easy to install" problem beautifully — it's a Go binary you can drop anywhere — but by default it does not report p99 latency, only average and slowest. For anything user-facing, the average is a lie and the max is a single outlier; I want the middle-of-the-long-tail number.

oha is actually very close to the thing I want. It's Rust, it uses HDR histograms, it's installable, and its output is rich. But it's also a full-screen TUI, and when I'm running a bench from tmux and then greping the output, I don't want a TUI. I want a plain text dump or a single JSON blob.

So the thing I kept reaching for but couldn't find was:

A binary I can get into a container in one line.
Gives me correct percentiles, not just averages.
Prints text or JSON to stdout, nothing fancy.
Small enough to read in one sitting so I can modify it when I need to.

That's http-bench. Five dependencies (clap, tokio, reqwest with rustls, hdrhistogram, humantime), about 900 lines across five source files, 11.6 MB multi-stage Alpine image.

Why HDR histograms are not optional

The first thing I want to talk about is the one decision that moves this tool from "naive" to "actually correct": using an HDR histogram for latency instead of a Vec<Duration>.

Here's the naive approach to percentiles. Record every latency sample into a vector. At the end, sort the vector and pick samples[samples.len() * 0.99] for p99. This works fine for small runs but has two problems at load-tester scale:

Memory grows linearly with request count. At 50,000 RPS for 60 seconds that's three million Durations, which isn't catastrophic but isn't free.
The final sort is not free either. You pay O(n log n) for the quantile query, which would be annoying if you wanted multiple quantiles or live readouts.

The HDR (High Dynamic Range) histogram trades a tiny amount of precision for constant-time record and query. It buckets values logarithmically, so it uses roughly the same amount of memory to represent a billion samples or a hundred samples, and you can ask for any quantile in O(bucket_count) — basically instant.

The "tiny amount of precision" is configurable. In http-bench I set 3 significant figures, which means any value returned by a quantile query is within 0.1% of the real sample that landed in that bucket. That's way more precision than you actually need when measuring HTTP latency over an unreliable network.

Here's what the wrapper looks like:

pub struct LatencyHistogram {
    h: Histogram<u64>,
}

impl LatencyHistogram {
    pub fn new() -> Self {
        // 1 µs .. 60 s, 3 significant figures.
        let h = Histogram::<u64>::new_with_bounds(1, 60_000_000, 3)
            .expect("bounds/sigfig are valid constants");
        Self { h }
    }

    pub fn record(&mut self, d: Duration) {
        let us = d.as_micros().min(u64::MAX as u128) as u64;
        let us = us.clamp(1, 60_000_000);
        let _ = self.h.record(us);
    }

    pub fn quantile(&self, q: f64) -> Duration {
        if self.h.is_empty() {
            return Duration::ZERO;
        }
        Duration::from_micros(self.h.value_at_quantile(q))
    }

    pub fn merge(&mut self, other: &LatencyHistogram) {
        self.h.add(&other.h)
            .expect("merging two histograms with identical bounds cannot fail");
    }
}

The merge method is important because of how I structured the worker pool. Each worker owns its own private histogram and records into it without any locking. At the end of the run I fold them all into one global histogram via merge. Zero contention during the hot loop, correct aggregation at the end.

Workers as tasks, not threads

The second decision worth explaining is that "concurrency" here means tokio tasks, not OS threads. When you pass --concurrency 1000, you get a thousand tokio tasks sharing the multi-thread runtime, not a thousand OS threads. A tokio task on the multi-thread runtime is something like a few kilobytes of heap; an OS thread is something like a megabyte of stack plus a scheduler slot. The task version scales to concurrency levels that would tip over a thread-per-worker design.

The hot loop in each task is basically:

loop {
    if stop.load(Ordering::Relaxed) {
        break;
    }
    let prev = remaining.fetch_sub(1, Ordering::Relaxed);
    if prev == 0 {
        remaining.fetch_add(1, Ordering::Relaxed);
        break;
    }
    let outcome = fire(&client, &cfg).await;
    state.record(outcome);
}

Two stop conditions. In duration mode an AtomicBool gets flipped to true by a sleeper task when the wall-clock deadline passes. In request-count mode an AtomicU64 counts down, and whichever worker hits zero first wins. The fetch_add(1) after we overshoot is a small bit of honesty: the counter reports what was actually fired, not what was requested.

Each worker uses a shared reqwest::Client, which is cheap to clone because it's internally reference-counted. The client is configured with pool_max_idle_per_host(concurrency), which is roughly the right number for a steady-state keepalive scenario — one pooled connection per worker, reused as long as the server is willing.

The coordinated omission problem (being honest)

Now for the part where I have to be honest, because pretending to solve this problem is how load testers lie.

Coordinated omission is the failure mode where a load tester's "we stalled waiting for a slow response" gets hidden from the latency distribution. The canonical scenario: you're trying to send 1000 requests per second, i.e. one request per worker per millisecond. Your server hiccups for 500 ms. During that hiccup your worker is stuck waiting on the in-flight request and fires zero new requests. Your p99 latency shows one request that took 500 ms, and you conclude "99% of users see under 2 ms, one unlucky user saw half a second." What actually happened is that several hundred would-be requests were never sent at all, and if a real user had tried during that window, they would have seen a full 500 ms too. They are missing from your distribution. The tool coordinated with the hiccup by omitting the samples it would have hurt the most.

Gil Tene (the creator of HdrHistogram, as it happens) has written extensively about this. The rigorous fix is to schedule requests against an intended rate, not an "as fast as possible per worker" loop, and when a response comes back late, synthetically backfill the histogram with the virtual samples that would have been generated during the stall.

http-bench does not do this. It runs the naive open-loop-at-max-concurrency pattern, the same pattern wrk runs by default. I call this out honestly in the README and in the article because a load tester that silently ignores coordinated omission while still printing a clean p99 is actively misleading.

What this means in practice: the tool is fine for "roughly how fast can this endpoint go and what's the latency shape at saturation" questions. It is not the right tool for "what would my users actually experience at a specific target RPS when the server has a 500 ms hiccup" questions. For that you want a closed-loop tool like wrk2 or a proper load-shape-aware generator.

I considered adding a --rate flag that would do the intended-rate math, but that's a pretty significant increase in complexity (you need a pacer per worker, a synthetic backfill step, and a way to decide what the original send time should have been) and it didn't fit the "small enough to read in one sitting" constraint. Better to not ship a half-solution.

Per-worker RPS in the report is a small sanity check on this. If you see one worker at 50 RPS and another at 200 RPS, something is off. Usually it means a keepalive connection got stuck somewhere and the client isn't balancing load. That tiny field has saved me from bad conclusions twice already.

The safety gate

The third intentional decision: by default, http-bench refuses to hit a target that isn't obviously yours. Private IP ranges (127.0.0.0/8, 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16, 169.254.0.0/16), *.local, *.localhost, and literal localhost are allowed. Everything else needs --allow-internet.

fn ipv4_is_private(v4: Ipv4Addr) -> bool {
    let [a, b, _, _] = v4.octets();
    v4.is_loopback()
        || v4.is_unspecified()
        || v4.is_link_local()
        || a == 10
        || (a == 172 && (16..=31).contains(&b))
        || (a == 192 && b == 168)
}

pub fn classify(host: &str) -> TargetKind {
    if host.eq_ignore_ascii_case("localhost") {
        return TargetKind::Private;
    }
    if let Ok(ip) = host.parse::<IpAddr>() {
        return match ip {
            IpAddr::V4(v4) if ipv4_is_private(v4) => TargetKind::Private,
            IpAddr::V6(v6) if v6.is_loopback() || v6.is_unspecified() => TargetKind::Private,
            _ => TargetKind::Public,
        };
    }
    if host.ends_with(".localhost") || host.ends_with(".local") {
        return TargetKind::Private;
    }
    TargetKind::Public
}

This is syntactic, not DNS-aware. A hostname like my-box.internal that happens to resolve to 127.0.0.1 still classifies as public — we don't want to consult DNS before deciding whether the user consented. If you own the host you can either bracket the IP literal or add --allow-internet.

Why is this worth it? Because firing thousands of requests per second at someone else's server without permission can range from "annoying" to "your cloud provider's acceptable use policy explicitly forbids this" to "literal denial-of-service under your local computer misuse law". Making the refusal the default and the opt-in a single flag is cheap ergonomically and saves future-me from fat-fingering a URL.

(The .local suffix case is for mDNS users; .localhost is because RFC 6761 carves it out as a local-only TLD, and having it fail the safety gate would be silly.)

Try it in 30 seconds

git clone https://github.com/sen-ltd/http-bench.git
cd http-bench
docker build -t http-bench .

# Safe-by-default against a public target.
docker run --rm http-bench https://example.com
# exit 2, friendly message pointing at --allow-internet

# Short run against a public target with the explicit opt-in.
docker run --rm http-bench https://example.com \
    --allow-internet --duration 2s --concurrency 4 --timeout 3s

The second command prints the full text report shown in the screenshot at the top: RPS, per-worker RPS, status distribution, and latency percentiles from HDR histograms.

Tradeoffs and what this tool is not

Real talk:

No coordinated-omission correction. Discussed above. Use wrk2 if you care.
No HTTP/2 multiplexing. reqwest can do HTTP/2, but each worker owns one logical connection at a time. If you want to saturate an HTTP/2 endpoint with one logical request per stream, this isn't the right shape.
No streaming bodies. --body / --body-file load the whole thing into memory and clone a Vec<u8> per request. Fine for JSON payloads, bad for uploading multi-gigabyte blobs.
No TLS session resumption check. Every request goes through the reqwest connection pool's existing session; there's no visibility into whether a handshake happened or got reused.
No Lua scripting. wrk's killer feature. Not present here, and not planned.

What this tool is: a small, readable, honest-about-its-limits load tester I can run from docker run on any box, with correct percentiles via HDR histograms and a safety gate so I don't accidentally ruin someone's day. Exactly the thing I'd leave running in a tmux pane while I'm debugging a local service.

Closing

Entry #172 in a 100+ portfolio series by SEN LLC. Related entries worth looking at:

port-scanner — the same safety-gate pattern applied to a polite TCP port scanner.
http-runner — execute .http files from the CLI, useful alongside this for "hit this real endpoint and bench it".

Feedback welcome. If you have strong opinions about coordinated omission, I already know.

DEV Community

Writing an HTTP Load Tester That Doesn't Lie About p99

Writing an HTTP Load Tester That Doesn't Lie About p99

The gap I kept hitting

Why HDR histograms are not optional

Workers as tasks, not threads

The coordinated omission problem (being honest)

The safety gate

Try it in 30 seconds

Tradeoffs and what this tool is not

Closing

Top comments (0)