Hermógenes Ferreira

Posted on Jan 5

When Benchmarks Lie: Saturation, Queues, and the Cost of Assuming Linearity

#webdev #programming #distributedsystems #systemdesign

If you didn't run your system at 1M requests per second, it doesn't support 1M requests per second.

Running it at 100k and extrapolating the rest assumes linearity, and linearity is the most dangerous assumption in performance engineering.

Most performance benchmarks don't fail because the numbers are wrong.
They fail because we ask them the wrong question.

Low latency at low load doesn't predict high throughput at high load.
It predicts only one thing: the system wasn't busy yet.

This post explains why that assumption breaks, why queues quietly dominate system behavior, and why throughput and latency are inseparable once you leave the comfort of idle benchmarks.

The intuition trap: "fast = scalable"

You run a benchmark and see:

p99 latency: 1 microsecond
test duration: a few minutes
CPU looks idle
no errors

The natural conclusion: "At 1 microsecond per request, this system can do ~1M requests per second."

This assumes linearity: double the load, double the work, latency stays flat, resources scale smoothly.

Real systems don't work like that.

A real-life analogy: the coffee shop

A coffee machine makes a coffee in 1 second. Does that mean the shop can serve 60 people per minute?

Only if customers arrive one at a time, nobody queues, and nobody orders at the same moment.

The moment two people arrive together, one waits. The coffee machine is still "fast", but latency increases because of waiting.

Your server works the same way.

The one rule you actually need

There's a fundamental relationship that applies to coffee shops, highways, and servers:

Average concurrency = throughput × latency

This is Little's Law.
The implication: if you want more throughput, either latency must increase or concurrency must increase.

You don't get infinite throughput just because one request is fast.

Applying this to the "1M RPS" claim

Simple math:

Average latency: 1 microsecond
Target throughput: 1,000,000 requests/second

Little's Law says: on average, the system has exactly 1 request in flight.

One. No overlap. No waiting. No queue.

That's physically impossible for a real server.
Requests arrive simultaneously, the OS schedules work, network packets queue, memory is shared.

So what really happens?
Requests wait. Queues form. Latency grows.

Why linear extrapolation fails

The core mistake: "It works at 100k RPS, so it should work at 1M RPS."

This ignores:

Queue growth
Contention
OS limits
Network saturation
Scheduling delays

These effects are non-linear. They appear near the limit, not at half load. Systems often look fine right up until they collapse.

What short benchmarks actually measure

Short benchmarks typically run for a few minutes with light traffic, no queues, and warm caches.

What they really measure: "How fast is my code when nothing is waiting?"

That's useful, but it's not capacity. It's like timing how fast a cashier scans one item and assuming the store can handle Black Friday.

The invisible enemy: queues

As load increases, requests start waiting, queues grow, latency increases, and throughput stops scaling.

Nothing crashes. No error appears. No bug is introduced. The system just gets slower.

Another analogy: highways

A highway at night is empty, fast, and smooth. The same highway at rush hour (same road, same speed limit) develops traffic jams.

Why? Once the road is almost full, small increases in traffic cause huge delays. Servers behave exactly like this.

Network limits show up before your code

At high request rates with realistic payloads (roughly 1 KB per request):

Network cards queue packets
Kernel buffers fill
Interrupts pile up

Latency increases before your application code even runs. A microbenchmark won't show this. A real load test will.

Ports and connections also saturate

High request rates hit limits:

TCP port exhaustion
Accept queue overflows
File descriptor limits
Sockets stuck in TIME_WAIT

These are operating system constraints, not application bugs. Short tests rarely reach them. Production traffic does.

Why nanosecond benchmarks are misleading

Nanosecond benchmarks usually prove one thing: the data was already in cache.

They measure best-case conditions: hot CPU caches, no contention, no waiting. They don't measure queue buildup, cache eviction, real concurrency, or sustained pressure.

They're useful for understanding code-level performance, but dangerous if treated as capacity numbers.

How to test honestly

If you want to know whether your system handles 1M requests per second:

Run it at 1M RPS
Run it long enough for queues to form (minutes, not seconds)
Watch latency as load increases
Expect non-linear behavior
Stop assuming linearity

Anything less is guessing.

The takeaway

Latency isn't a property of a request. It's a property of a system under load.

If you push more traffic, latency must increase, or concurrency must increase, or the system saturates. There is no fourth option.

Once you internalize this, benchmarks stop lying — because you stop asking them the wrong questions.

DEV Community