If you didn't run your system at 1M requests per second, it doesn't support 1M requests per second.
Running it at 100k and extrapolating the rest assumes linearity, and linearity is the most dangerous assumption in performance engineering.
Most performance benchmarks don't fail because the numbers are wrong.
They fail because we ask them the wrong question.
Low latency at low load doesn't predict high throughput at high load.
It predicts only one thing: the system wasn't busy yet.
This post explains why that assumption breaks, why queues quietly dominate system behavior, and why throughput and latency are inseparable once you leave the comfort of idle benchmarks.
The intuition trap: "fast = scalable"
You run a benchmark and see:
- p99 latency: 1 microsecond
- test duration: a few minutes
- CPU looks idle
- no errors
The natural conclusion: "At 1 microsecond per request, this system can do ~1M requests per second."
This assumes linearity: double the load, double the work, latency stays flat, resources scale smoothly.
Real systems don't work like that.
A real-life analogy: the coffee shop
A coffee machine makes a coffee in 1 second. Does that mean the shop can serve 60 people per minute?
Only if customers arrive one at a time, nobody queues, and nobody orders at the same moment.
The moment two people arrive together, one waits. The coffee machine is still "fast", but latency increases because of waiting.
Your server works the same way.
The one rule you actually need
There's a fundamental relationship that applies to coffee shops, highways, and servers:
Average concurrency = throughput × latency
This is Little's Law.
The implication: if you want more throughput, either latency must increase or concurrency must increase.
You don't get infinite throughput just because one request is fast.
Applying this to the "1M RPS" claim
Simple math:
- Average latency: 1 microsecond
- Target throughput: 1,000,000 requests/second
Little's Law says: on average, the system has exactly 1 request in flight.
One. No overlap. No waiting. No queue.
That's physically impossible for a real server.
Requests arrive simultaneously, the OS schedules work, network packets queue, memory is shared.
So what really happens?
Requests wait. Queues form. Latency grows.
Why linear extrapolation fails
The core mistake: "It works at 100k RPS, so it should work at 1M RPS."
This ignores:
- Queue growth
- Contention
- OS limits
- Network saturation
- Scheduling delays
These effects are non-linear. They appear near the limit, not at half load. Systems often look fine right up until they collapse.
What short benchmarks actually measure
Short benchmarks typically run for a few minutes with light traffic, no queues, and warm caches.
What they really measure: "How fast is my code when nothing is waiting?"
That's useful, but it's not capacity. It's like timing how fast a cashier scans one item and assuming the store can handle Black Friday.
The invisible enemy: queues
As load increases, requests start waiting, queues grow, latency increases, and throughput stops scaling.
Nothing crashes. No error appears. No bug is introduced. The system just gets slower.
Another analogy: highways
A highway at night is empty, fast, and smooth. The same highway at rush hour (same road, same speed limit) develops traffic jams.
Why? Once the road is almost full, small increases in traffic cause huge delays. Servers behave exactly like this.
Network limits show up before your code
At high request rates with realistic payloads (roughly 1 KB per request):
- Network cards queue packets
- Kernel buffers fill
- Interrupts pile up
Latency increases before your application code even runs. A microbenchmark won't show this. A real load test will.
Ports and connections also saturate
High request rates hit limits:
- TCP port exhaustion
- Accept queue overflows
- File descriptor limits
- Sockets stuck in TIME_WAIT
These are operating system constraints, not application bugs. Short tests rarely reach them. Production traffic does.
Why nanosecond benchmarks are misleading
Nanosecond benchmarks usually prove one thing: the data was already in cache.
They measure best-case conditions: hot CPU caches, no contention, no waiting. They don't measure queue buildup, cache eviction, real concurrency, or sustained pressure.
They're useful for understanding code-level performance, but dangerous if treated as capacity numbers.
How to test honestly
If you want to know whether your system handles 1M requests per second:
- Run it at 1M RPS
- Run it long enough for queues to form (minutes, not seconds)
- Watch latency as load increases
- Expect non-linear behavior
- Stop assuming linearity
Anything less is guessing.
The takeaway
Latency isn't a property of a request. It's a property of a system under load.
If you push more traffic, latency must increase, or concurrency must increase, or the system saturates. There is no fourth option.
Once you internalize this, benchmarks stop lying — because you stop asking them the wrong questions.
Top comments (0)