Rajkiran

Posted on May 31

System Design - 5.Latency vs Throughput Latency vs Throughput: Why "Average Response Time" Is the Biggest Lie in Engineering

#systemdesign #designsystem #software #softwaredevelopment

The Misleading Dashboard
Imagine you're an engineer at an e-commerce company. Your monitoring dashboard shows:

*Average API response time: 120ms *✅

Looks healthy. You go home. Sleep well.

But tomorrow morning, 1% of your users are filing complaints. Their checkouts are taking 8–10 seconds. They're abandoning their carts. You're losing thousands of dollars per hour.

Your average is fine. Your tail latency is catastrophic.

This is one of the most common and costly blind spots in engineering. And it starts with a fundamental misunderstanding of what latency and throughput actually mean — and how to measure them properly.

Latency: What It Really Is
_Latency _is the time it takes for a single request to travel from start to finish.

User sends request → [time passes] → User receives response

|←─────────── Latency ────────────→|

Simple enough. But here's where engineers go wrong: they think "latency" means "average latency."

Average latency hides the suffering of your slowest users.

If you have 100 requests and 99 take 100ms while 1 takes 9 seconds:

Average = (99 × 100ms + 1 × 9000ms) / 100 = 188ms

Your average looks fine (188ms). But 1 in every 100 users just waited 9 seconds. That's your tail.

Tail Latency: The Metric That Actually Matters
Percentile latency (p50, p90, p99, p999) tells you the experience of every percentile of your users — not just the average.

p50 (median): 100ms → half your users are faster, half slower

p90: 200ms → 90% of users faster than this, 10% slower

p99: 800ms → 99% faster, 1% slower

p999: 5000ms → 99.9% faster, 0.1% slower

p99 is the standard production metric. It tells you what the 1-in-100 user experiences. At 1 million requests per day, p99 = 10,000 users having a bad experience every day.

p999 matters for financial systems and health-critical applications — the 1-in-1000 case. At high volume, even 0.1% is a lot of users.

Google's internal rule: They optimize for p99, not average. Because the users who experience high latency are disproportionately likely to be your most active users — they're making more requests, so they hit the tail more often.

Why Tail Latency Is Hard to Kill
The "slow server" problem:

In a distributed system, a single user request often fans out to multiple backend services:

User Request

├── Service A (50ms)

├── Service B (80ms)

├── Service C (200ms ← occasional GC pause)

└── Service D (60ms)

Total latency = max(50, 80, 200, 60) = 200ms (not the sum, the max)

The slowest dependency determines total latency. This is called the "long tail of latency in parallel calls" — and it means that even if each service has a 1% chance of being slow, a request that touches 100 services has a 63% chance of hitting at least one slow one.

P(at least one slow service) = 1 - (1 - 0.01)^100 ≈ 63%

The hedged request pattern (Google's solution): Send the same request to two servers simultaneously. Use whichever responds first. Cancel the other.

Cost: 2x backend load for the hedged requests. Benefit: p99 latency drops dramatically — you're no longer at the mercy of the slow tail.

Throughput: The Other Half of the Equation
Throughput is how many requests your system handles per unit of time.

Throughput = requests per second (RPS) or queries per second (QPS)

If latency is about speed for one user, throughput is about capacity for all users simultaneously.

A restaurant analogy:

-Latency = how long your meal takes from order to table
-Throughput = how many meals the restaurant serves per hour

A fine dining restaurant might have 45-minute latency (your meal takes 45 min) but low throughput (serves 30 meals/hour). A fast food restaurant has 3-minute latency and massive throughput (600 meals/hour).

Neither is better — it depends what you're optimizing for.

The Latency-Throughput Trade-off
Here's the non-obvious insight: latency and throughput often trade off against each other.

Scenario: Database writes

Option A: Write synchronously to all 3 replicas before responding

→ Latency: High (wait for all 3)

→ Throughput: Low (can only process next request after all 3 ack)

Option B: Write to 1 replica, acknowledge immediately, replicate async

→ Latency: Low (respond immediately)

→ Throughput: High (pipeline many requests)

→ Trade-off: Brief inconsistency window

_Batching increases throughput at the cost of latency:
_
Without batching:

Request 1 → DB write (5ms) → respond

Request 2 → DB write (5ms) → respond

Throughput: 200 req/sec

With batching (wait 10ms, write 50 requests together):

50 requests → 1 DB write (8ms) → respond to all 50

Throughput: ~6,000 req/sec

Latency: +10ms wait time (worse for individual requests)

Kafka uses batching aggressively. Producers batch messages for a few milliseconds before sending. This is why Kafka's throughput is extraordinary — but individual message latency is slightly higher than a system that writes one message at a time.

Amdahl's Law: The Ceiling on Parallelism
Here's a fundamental truth about performance optimization that most engineers learn too late:

Amdahl's Law: The maximum speedup from parallelizing a program is limited by its sequential portions.

Speedup = 1 / (S + (1 - S) / N)

Where:

S = fraction of the program that must run sequentially (can't be parallelized)

N = number of parallel processors

Example:

Your task: 80% can be parallelized, 20% must be sequential (S = 0.20)

With 2 processors: Speedup = 1 / (0.2 + 0.8/2) = 1 / 0.6 = 1.67×

With 4 processors: Speedup = 1 / (0.2 + 0.8/4) = 1 / 0.4 = 2.5×

With 8 processors: Speedup = 1 / (0.2 + 0.8/8) = 1 / 0.3 = 3.33×

With ∞ processors: Speedup = 1 / 0.2 = 5× (the ceiling!)

Even with infinite servers, you can only ever get 5× speedup if 20% of your work is sequential.

What this means for system design:

Adding more servers hits diminishing returns quickly
The bottleneck is always your sequential component — usually the database, a single queue consumer, or a global lock
Optimize the sequential bottleneck first before adding horizontal capacity

Real example: Uber's geospatial matching — finding nearby drivers — must happen in a single, ordered process (otherwise two requests might match the same driver). This sequential requirement is why Uber invests so heavily in optimizing that single component rather than just adding more servers.

How Google Search Achieves Sub-100ms Latency
Google Search processes your query in under 100ms. Globally. Against petabytes of indexed data. This is arguably the most impressive latency achievement in software history.

How? A combination of several techniques:

1. Inverted index pre-computation Google doesn't search the web when you search. It searches a pre-built index that maps every word to the pages containing it. The real work happens before your query.

2. Massive parallelism with result merging Your query fans out to thousands of servers simultaneously, each searching a slice of the index. Results are merged. The total latency is the max of all parallel calls plus merge time — not the sum.

3. Proximity and caching Your request goes to a data center close to you. Popular queries are cached. Cache hits return in single-digit milliseconds.

4. Tail latency management Slow-responding servers are detected, and results are assembled without them. Better to return a slightly incomplete result fast than a complete result slowly.

The lesson for your system designs: Latency at scale requires pre-computation (don't compute at query time), parallelism (fan out, then merge), proximity (serve from nearby), and graceful degradation (handle slow components without blocking).

AWS Lambda: High Throughput, Variable Latency
AWS Lambda is the canonical example of a high-throughput, latency-variable system.

Cold start problem: When a Lambda function hasn't been invoked recently, AWS needs to provision a container, load your code, and initialize your runtime. This cold start adds 100ms–2 seconds of latency.

For a function that normally runs in 50ms, a cold start means 20-40x higher latency for that first request.

How teams handle it:
_
Provisioned concurrency_ — keep a pool of warm containers always ready (eliminates cold starts, increases cost)
Ping/keep-alive — schedule a heartbeat invocation every few minutes to keep containers warm
_Minimize package size _— smaller packages = faster cold starts

This is the throughput vs latency trade-off made explicit: Lambda gives you infinite throughput (scale to millions of requests/sec) at the cost of variable, sometimes high latency.

Interview Scenario: "How Do You Reduce p99 Latency?"
This is a real senior engineer interview question. Here's a structured answer:

Step 1: Identify where latency is coming from

Use distributed tracing (Jaeger, Zipkin) to find which service

contributes most to p99 latency. It's almost never where you expect.

Step 2: Check for the usual suspects

Database: missing index? N+1 query problem? Lock contention?
Network: are you making sequential calls that could be parallel?
GC pauses: Java/Go garbage collection causing periodic freezes?
Thread pool exhaustion: requests queuing waiting for threads?
External API calls: are you blocking on a slow third-party service?

Step 3: Apply targeted fixes

Add caching in front of slow operations
Parallelize sequential calls where possible
Use async/non-blocking I/O to avoid thread exhaustion
Implement circuit breakers for slow external dependencies
Consider hedged requests for the most latency-sensitive paths

Step 4: Monitor the right metric

Fix for p99, not for average.

After your fix, check p99, p95, p90 — not just average latency.

Key Takeaways
Latency = time for one request. Throughput = requests handled per second. Both matter, but they measure different things.
Average latency is misleading. p99 and p999 tell you what your worst-served users actually experience.
At scale, tail latency compounds — a request touching many services is likely to hit at least one slow one.
Latency and throughput trade off. Batching improves throughput but increases individual request latency.
Amdahl's Law says you can't parallelize your way past the sequential bottleneck. Find it and fix it first.
Hedged requests (send to two servers, use the faster) are Google's primary p99 reduction technique.
Always trace before optimizing — latency problems are almost never where you assume.

DEV Community

System Design - 5.Latency vs Throughput Latency vs Throughput: Why "Average Response Time" Is the Biggest Lie in Engineering

Top comments (0)