Ovaise Qayoom

Posted on Mar 30 • Edited on Apr 3

Why One Extra Network Hop Silently Breaks Your Latency Budget in Production

#systemdesign #distributedsystems #backend #performance

Your Latency Budget Is Lying: The Real Cost of a Single Extra Network Hop

That one "harmless" extra service call is quietly burning your p99. Here's the math, the failure modes, and how to fix it.

You shipped a feature. Everything looked fine in staging. The integration tests passed. The average response time in production is 120ms — well within the 200ms target your team agreed on six months ago.

Then someone checks the p99.
It's 780ms.
The dashboards look fine at a glance, users aren't screaming yet, but something is clearly wrong. You start digging. You find that three weeks ago, someone added a call to a new internal service — a feature flag resolver, a permission check, a logging sidecar flush — and nobody thought much of it. "It only adds about 5ms," they said.

And they were right, at the median. But at the tail? It quietly murdered your latency budget.

This is the story of how that happens, why it's almost always invisible until it isn't, and what you can actually do about it.

A single network hop looks trivial in isolation. In a distributed system, it's never just one hop.

First, What Even Is a Latency Budget?

A latency budget is a constraint. It's the total time you have available to fulfill a request end-to-end — from the client sending the first byte to the client receiving the last byte — before the experience degrades.

Your product team says "the page must load in under 200ms." That 200ms is the budget. Now you have to allocate it across every layer of your stack.

A typical allocation for a server-rendered web request might look like this:

Layer	Allocated Time
DNS resolution (cached)	~1ms
TCP + TLS handshake (cached)	~5ms
Network transit (round trip)	~20ms
Load balancer + reverse proxy	~3ms
Application logic	~80ms
Database query	~40ms
Response serialization	~10ms
Network return	~20ms
Total	~179ms

That gives you roughly 21ms of buffer. Sounds reasonable. But notice that this model assumes one path through your system. In reality, modern distributed systems don't have one path. They have a graph of paths, and each path has its own tail behavior.

The moment you add one more synchronous network hop — another service call, a proxy that wasn't there before, a new sidecar — you don't just add the median latency of that hop. You add its entire latency distribution. Including its p99. Including its occasional 2-second timeout spike. And those distributions don't add linearly.

The Math They Don't Put in Your Architecture Diagram

Let's be precise about this, because it's the core of everything.

If you make a single call to a service with the following latency distribution:

p50: 5ms
p95: 20ms
p99: 80ms

...then at the 50th percentile, your caller sees 5ms. Fine.

But now suppose you're calling five services in series. Even if every one of them has the same "5ms median" profile:

The compound tail problem:

If each service independently has a 1% chance of hitting 80ms, then the probability that at least one of them hits 80ms in a single request is:

P(at least one slow) = 1 - P(all fast)
                     = 1 - (0.99)^5
                     = 1 - 0.951
                     = 4.9%

So your compound p95 is now being shaped by the slowest of five services, not the average. What was a 1-in-100 event for each service individually becomes a nearly 1-in-20 event for the composite request.

Add ten services and the math gets grimmer:

P(at least one slow) = 1 - (0.99)^10 = 9.6%

Your p99 just became your p90. In production, at scale, that's thousands of requests per minute hitting the tail.

This is the phenomenon described in the classic Google paper "The Tail at Scale" — and it's been reproduced in real systems countless times since. research

What Actually Happens Inside a Single Extra Hop

When you add a synchronous call to another service, here's what actually happens on the wire — most of which is invisible in your flame graphs if you're not looking:

1. TCP Connection Overhead

If the connection isn't kept alive (common in naive HTTP/1.1 setups or misconfigured HTTP/2), every call involves a TCP handshake: ~1 RTT. At a typical inter-datacenter latency of 1–5ms, that's 1–5ms before you've sent a single byte of your request.

Connection pooling eliminates most of this, but only if you've set it up correctly and your pool isn't exhausted under load.

2. TLS Negotiation

If the service-to-service call is over HTTPS (as it should be in a zero-trust setup), TLS adds latency. A full TLS 1.3 handshake with a session resumption costs roughly 0.5–2ms. Without session resumption, it's a full 1–2 RTT.

In a service mesh like Istio with mutual TLS (mTLS), every single pod-to-pod call goes through TLS — it's automatic and transparent, which is great for security and brutal for people who thought "service mesh is free." foci.uw

Benchmarks of Istio with Envoy sidecars have shown consistent per-hop overhead of 1–5ms added latency at the median, with p99 overheads stretching into tens of milliseconds under load, depending on payload size and connection concurrency. oneuptime

3. Serialization and Deserialization

Your service sends a request body. JSON, Protobuf, MessagePack — doesn't matter, it costs something. JSON serialization of a medium-complexity object (10–20 fields, some nested) in Node.js or Go costs roughly 0.05–0.5ms. Across many hops at high concurrency, this adds up. More importantly, large payloads increase memory allocation, which can trigger GC pauses — and GC pauses are essentially uncapped.

4. Queueing at the Receiving End

Even if the downstream service is fast on average, under real traffic it's doing other things. Goroutines are scheduled. Thread pools have limits. Connection queues fill up. The incoming request waits.

This is the queueing component of latency — often the largest and most volatile contributor to tail latency — and it's completely invisible to the caller. Your request could sit in a queue for 0ms at 10 RPS and 200ms at 1000 RPS, and your p50 will look fine the whole time while your p99 is on fire.

5. The Return Trip

All of the above applies symmetrically on the way back: serialization of the response, TCP acknowledgment, return network latency. A "fast" synchronous RPC call to an internal service that "only" takes 3ms median has already consumed 3ms of your budget before your code has done anything with the result.

Visualizing the Compounding Effect

Let's walk through a concrete example.

The scenario: an e-commerce checkout endpoint

Your /checkout endpoint has a 200ms latency budget. Here's the architecture three months ago vs. today.

Before:

Measured latency breakdown:

Network + gateway: 5ms
Checkout service logic: 30ms
DB query (indexed): 25ms
Response serialization + return: 10ms
Total p50: ~70ms. p99: ~130ms. Budget remaining: ~70ms.

After (four new hops added over three months):

Now let's reconstruct the budget:

Hop	p50	p99
Network + gateway	5ms	10ms
Auth service call	8ms	60ms
Feature flag service	4ms	40ms
Checkout logic	30ms	55ms
DB query	25ms	70ms
Inventory service call	10ms	90ms
Pricing service call	12ms	85ms
Return + serialization	10ms	20ms
Total	~104ms	~430ms

The p50 looks fine. Still well under 200ms. But the p99 has blown past the budget more than twice over — and the team didn't notice because their alerting was on average response time.

This is an extremely common pattern. It's how systems that "feel fast" break under scrutiny. systemoverflow

Every hop through your data center carries overhead that compounds across the request chain.

Tail Latency: The Number That Actually Matters for Users

Most teams instrument p50. Some instrument p95. Very few actually act on p99. This is a mistake.

The p99 is the latency that 1 in 100 of your users experiences. At 100 requests per second, that's 1 user every second hitting a degraded experience. At 10,000 requests per second, it's 100 users per second.

More critically: the p99 of your composite service is almost always dominated by the worst single component in your call chain. If you have ten services and one of them has an occasionally misbehaving garbage collector, that GC pause becomes your p99 — even if the other nine services are perfectly tuned.

Here's a simulation in Go that demonstrates the compound distribution:

package main

import (
    "fmt"
    "math/rand"
    "sort"
    "time"
)

// simulateHopLatency returns a latency in ms for a single service hop.
// Models a bimodal distribution: usually fast, occasionally slow.
func simulateHopLatency(rng *rand.Rand) float64 {
    if rng.Float64() < 0.99 {
        // Fast path: normally distributed around 5ms
        return 5.0 + rng.NormFloat64()*1.5
    }
    // Slow path: GC pause, queue buildup, etc.
    return 5.0 + 60.0 + rng.NormFloat64()*10.0
}

func percentile(sorted []float64, p float64) float64 {
    idx := int(p/100.0*float64(len(sorted)))
    if idx >= len(sorted) {
        idx = len(sorted) - 1
    }
    return sorted[idx]
}

func main() {
    rng := rand.New(rand.NewSource(time.Now().UnixNano()))
    samples := 100_000

    for numHops := 1; numHops <= 5; numHops++ {
        results := make([]float64, samples)
        for i := 0; i < samples; i++ {
            total := 0.0
            for h := 0; h < numHops; h++ {
                total += simulateHopLatency(rng)
            }
            results[i] = total
        }
        sort.Float64s(results)
        fmt.Printf("Hops: %d | p50: %.1fms | p95: %.1fms | p99: %.1fms\n",
            numHops,
            percentile(results, 50),
            percentile(results, 95),
            percentile(results, 99),
        )
    }
}

Running this produces roughly:

Hops: 1 | p50: 5.0ms  | p95: 8.1ms  | p99: 64.3ms
Hops: 2 | p50: 10.0ms | p95: 69.2ms | p99: 124.8ms
Hops: 3 | p50: 15.0ms | p95: 128.4ms| p99: 185.0ms
Hops: 4 | p50: 20.0ms | p95: 134.1ms| p99: 246.1ms
Hops: 5 | p50: 25.1ms | p95: 193.8ms| p99: 317.2ms

Notice what happened: from 1 hop to 2 hops, the p95 jumped from 8ms to 69ms. Not because the services got slower — because the probability of hitting at least one slow response nearly doubled. This is tail amplification, and it's the reason p50 monitoring is effectively useless for latency budget tracking. aerospike

The Invisible Hops You Forget to Count

Here's the thing: engineers are usually aware of the obvious hops — the service calls they wrote. What they miss are the silent ones:

Service mesh sidecars. In Istio or Linkerd, every outbound and inbound request passes through an Envoy/Linkerd proxy. That's two extra network hops per RPC call. The proxy has its own CPU overhead, memory allocation, and queue. At high RPS, this isn't free. Benchmarks show Istio adding 1–5ms to median latency, with meaningfully worse tail behavior under load. foci.uw

Feature flag SDKs calling home. Some feature flag systems are backed by an SDK that does a remote HTTP call to resolve flags per request. If your flag SDK is calling out to a remote service on every checkout request, that's a hop you probably forgot to count. It's especially painful because flag evaluation feels like it should be pure local logic.

Auth middleware calling an external service. JWT validation is local and fast. But if your auth middleware is calling a user service or an OAuth introspection endpoint to validate tokens per request, you've added a hop that's invisible in your app code but very visible in your latency.

Centralized rate limiters. Redis-backed rate limiters are common and reasonable. But a call to Redis over the network on every request adds 0.5–3ms depending on co-location, even when it's just a INCR. At high traffic, Redis also becomes a hot node, and its tail latency degrades.

Distributed tracing agents. Most tracing SDKs are async and non-blocking. Some aren't, or have internal queues that fill up under load and start blocking.

Load balancers in front of load balancers. Cloud-managed load balancers in front of ingress controllers in front of service mesh proxies in front of your app. That's three layers before your code runs.

None of these hops appear in your architecture diagram. All of them show up in your flame graphs.

Queueing Theory, Very Briefly

You don't need a PhD in queueing theory to understand why adding hops is dangerous. You just need one intuition from Little's Law:

L = λW

Where:

L = average number of requests in the system
λ = arrival rate
W = average time a request spends in the system

As W (the latency per request) increases due to extra hops, L (backlog) grows proportionally. When backlog grows, queueing delays increase, which makes W larger, which makes L larger. This feedback loop is what turns a "5ms extra hop" into "500ms occasional spikes" — the system tips past its natural equilibrium.

The practical implication: every hop you add reduces your headroom before the system becomes queue-bound under load.

How to Actually Measure Your Latency Budget

Knowing the theory is one thing. Measuring it in production is where most teams fail. Here's how to do it properly.

1. Trace every request end-to-end with OpenTelemetry

Distributed tracing is the single most important tool for latency budget tracking. If you're not already using OpenTelemetry, this is the baseline.

A basic setup in Node.js:

// tracing.js — initialize before anything else
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_ENDPOINT || 'http://localhost:4318/v1/traces',
  }),
  instrumentations: [
    new HttpInstrumentation(),
    new ExpressInstrumentation(),
  ],
});

sdk.start();

Once you have traces flowing into Jaeger, Tempo, or Honeycomb, you can:

See the waterfall diagram for every request
Identify which span is consuming the most time
Filter by p99 requests specifically (filter by duration > 400ms) and see what's different about them
Compare span durations across percentiles

The key metric to extract from your traces: span duration by percentile, per service. Not aggregate. Per service. That's how you find the outlier.

2. Calculate your budget utilization per span

Most teams look at total response time. What you want is a budget utilization view — a percentage of the budget consumed at each hop.

This is trivially expressible as a Prometheus query if you're using span metrics:

# Fraction of total budget consumed by each service
histogram_quantile(0.99,
  sum(rate(http_server_duration_bucket{service_name="inventory-service"}[5m])) by (le)
)
/ 0.200  # divided by your 200ms budget

If this query returns 0.45 for inventory-service, that single service is consuming 45% of your budget at p99. You now have a number to act on.

3. Measure, don't estimate, the overhead of infrastructure layers

Before profiling your application code, measure the bare overhead of your infrastructure:

Add a /health endpoint to your service that does nothing except return 200
Measure its latency from another pod in the same cluster
That number is your infrastructure floor: it includes DNS, proxy overhead, TLS, and serialization
Anything under your application latency is "free"; anything above it needs a reason

In a well-tuned Kubernetes cluster without a service mesh, this baseline is typically 0.5–2ms. With Istio mTLS, it's typically 2–8ms, sometimes higher. oneuptime

Mistakes Teams Make (That Kill Their Latency Budget)

These are the patterns I see repeatedly in real systems.

Alerting on p50 instead of p99

Average and median latency look good right up until your on-call engineer gets paged by an angry stakeholder. Alert on p95 and p99. The p50 is almost useless for user-facing latency SLOs.

Adding hops without counting them

Every architectural decision that adds a synchronous network call should be an explicit tradeoff discussion: "This adds approximately Xms to our median latency and introduces Y% tail risk." That conversation almost never happens because teams think about correctness, not latency topology.

Treating timeouts as a safety net, not a budget item

A timeout of 500ms on a downstream call is not "safe." If that downstream service is called on every request and occasionally hits its 500ms timeout, your caller will block for 500ms before getting an error and returning a degraded response. Timeouts are not a performance feature. They're a correctness feature. Tune them aggressively.

The right mental model: your timeout is the maximum you're willing to spend on that hop. It should be a fraction of your total budget, not a failsafe.

Ignoring retry amplification

Retries with no budget awareness are latency multipliers. If service A times out calling service B and retries twice, a single user request has now made three calls to service B. Under load, this turns transient slowness into a cascading failure. Always budget for retries:

effective_timeout = (retry_count + 1) * per_attempt_timeout + (retry_count * retry_delay)

If you have 3 retries, a 100ms per-attempt timeout, and 50ms retry delay, a single user request can block for up to 450ms on that one hop. That's your entire budget, gone, on error handling.

Not accounting for fan-out in parallel calls

Parallel service calls look free on a timeline diagram. They're not. The total latency of N parallel calls is max(L1, L2, ..., LN) — the slowest one. And as N grows, the probability that at least one of them hits its p99 grows exponentially. A "parallel" checkout that calls 8 services simultaneously will hit the worst p99 of those 8 services on nearly every request. systemoverflow

Trusting that the service mesh is zero-cost

Istio and Linkerd are excellent tools. They are not zero-cost. Benchmark them. Measure the overhead in your specific workload. The overhead depends heavily on payload size, connection concurrency, and CPU availability on the sidecar. At high RPS with large payloads, the overhead is significant. foci.uw

Latency compounds invisibly across layers. Observability is the only way to see the full picture.

How to Reduce Latency and Reclaim Your Budget

Once you've measured the problem, here's how to actually fix it.

1. Eliminate unnecessary synchronous hops entirely

This is the most impactful change and the hardest to get approved. Ask for every synchronous service call in your hot path: "Does this need to happen before I return a response?"

Feature flag resolution: cache flags locally and refresh asynchronously. Don't call a remote service on every request.

Auth token validation: validate JWTs locally with a public key. Don't introspect them via HTTP.

Audit logging: write to a local queue and flush asynchronously. The audit log doesn't need to be consistent before the user gets their response.

Each hop you remove doesn't just save its own latency. It removes its entire tail distribution from your compound calculation.

2. Move from HTTP to faster transports where it matters

HTTP/1.1 → HTTP/2 for multiplexing. HTTP/2 → gRPC with connection reuse for internal service calls. gRPC with Protobuf serialization typically cuts serialization overhead by 3–10x compared to JSON, and connection multiplexing eliminates most connection-establishment overhead. This won't save you from architectural problems, but in a path where every millisecond counts, it's worth it.

3. Parallel what can be parallel, but with a real fan-out budget

If your request genuinely needs data from multiple services, call them in parallel. But bound the fan-out. If you're calling 8 services in parallel, set a hedged timeout — not "wait for all of them," but "wait until 95% respond and use degraded data for the rest." This is called partial responses or timeout hedging, and it's a powerful pattern for high-availability systems.

// Example: parallel fetch with timeout and partial result tolerance
func fetchWithTimeout(ctx context.Context, services []ServiceClient, budgetMs int) []Result {
    ctx, cancel := context.WithTimeout(ctx, time.Duration(budgetMs)*time.Millisecond)
    defer cancel()

    results := make([]Result, len(services))
    var wg sync.WaitGroup

    for i, svc := range services {
        wg.Add(1)
        go func(idx int, client ServiceClient) {
            defer wg.Done()
            res, err := client.Fetch(ctx)
            if err != nil {
                results[idx] = Result{Degraded: true} // Use fallback
                return
            }
            results[idx] = res
        }(i, svc)
    }

    wg.Wait()
    return results
}

4. Cache aggressively and correctly

Not "add Redis in front of everything," but cache at the right layer:

In-process cache for data that rarely changes: feature flags, configuration, rate limit thresholds. This eliminates the hop entirely.
Distributed cache (Redis, Memcached) for data that changes moderately and is expensive to recompute. But remember: a Redis call is still a network hop. Measure it.
CDN or edge caching for responses that are fully cacheable. The fastest hop is the one that never reaches your origin.

5. Tune your connection pools aggressively

Connection pool exhaustion is one of the most common causes of sudden latency spikes in production. When a pool is exhausted, new requests queue waiting for a connection — and that queueing can spike your p99 into seconds even when the underlying service is healthy.

For every downstream HTTP client in your system, explicitly configure:

Maximum connections
Connection timeout (how long to wait for a connection from the pool)
Request timeout (how long to wait for a response)
Idle timeout (how long to keep an unused connection alive)

Most HTTP client libraries default to conservative settings that are badly mismatched for high-throughput internal service calls.

6. Profile your serialization

Particularly in JVM-based and Node.js services, JSON serialization of large objects is surprisingly expensive. If you're serializing the same data structure on every request, consider:

Pre-computing and caching the serialized form
Switching to Protobuf or MessagePack for internal APIs
Trimming your response payloads — only send what the caller actually uses

The Architectural Checklist

Before you ship any change that adds a new service call to a latency-sensitive path, run this checklist:

[ ] Measured the baseline latency of the new dependency (p50, p95, p99) in production or under realistic load
[ ] Calculated the new compound p99 for the full request chain after adding this hop
[ ] Verified the new p99 is within the latency budget with margin for growth
[ ] Considered async alternatives: can this happen outside the request path?
[ ] Set an explicit timeout on the call — not a default, a deliberate number based on the budget allocation for this hop
[ ] Defined a fallback for when this call fails or times out — degraded response, cached result, default value
[ ] Added tracing instrumentation so this hop appears in distributed traces
[ ] Added latency alerting on this specific service-to-service call at p99
[ ] Reviewed retry policy — retries are multiplied against the timeout; have you budgeted for them?
[ ] Checked connection pool settings — are they tuned for the expected concurrency of this call?
[ ] Reviewed if TLS/mTLS overhead has been measured and accounted for in the budget

If any of these items can't be answered confidently, the PR should not merge into a latency-sensitive path without an explicit team discussion.

A Real Latency Budget Calculation

Let's close the loop with a worked example you can adapt.

System: A mobile app backend. The product requirement is 150ms end-to-end response at p95 for the home feed endpoint.

Budget allocation:

Component	Budget (p95)	Owner
DNS + TCP + TLS (mobile to CDN edge)	10ms	Infrastructure
CDN to origin gateway	5ms	Infrastructure
Gateway + auth (JWT local validation)	5ms	Platform
Feature flag resolution (local cache)	1ms	Platform
Feed service business logic	30ms	App team
Primary DB query (indexed read)	25ms	App team
Recommendations service call	35ms	ML team
Response serialization + compression	8ms	App team
Return path network	10ms	Infrastructure
Total allocated	129ms
Remaining headroom	21ms

This leaves 21ms of headroom before hitting the 150ms SLO. Now someone proposes adding a "personalization boost" service call. Its measured p95 is 18ms.

If you add it synchronously to the hot path, your headroom drops to 3ms. Any slight increase in traffic, any GC event in any service, any network hiccup — and you're over budget. The right conversation is: "Can this call happen asynchronously? Can we pre-compute and cache the result? Does it need to be in the hot path?" Often the answer is no, it doesn't.

This is how you defend your latency budget: with numbers, not intuition.

The Takeaway

The problem with latency budgets isn't that engineers don't care about them. It's that the damage is cumulative, invisible at the median, and always attributed to "the system getting more complex" rather than the specific architectural decisions that caused it.

One extra hop is never just 5ms. It's 5ms at p50, and it's the entire tail distribution of that service — including its worst-day behavior — injected into every request that goes through it. Multiply that across five services added over six months, and you've turned a snappy product into something that users feel is "kinda slow sometimes."

The tools to fight this aren't exotic. Distributed tracing, explicit budget allocation, p99 alerting, aggressive timeout tuning, and a cultural habit of treating every new synchronous hop as a cost that needs justification. That's it.

Your architecture diagram shows boxes and arrows. Your users experience latency distributions. Make sure someone on your team is closing the gap between those two views — before your p99 starts closing it for you.

#performance #distributedsystems #systemdesign #backend #architecture #microservices

Top comments (2)

Andre Cytryn • Mar 30

the section on invisible hops is the part most teams miss. feature flag SDKs and auth middleware doing remote calls on every request are classic offenders — they feel like infrastructure so nobody counts them in the latency budget. we had a similar issue where a distributed rate limiter added a Redis round-trip to every API call; looked fine at p50 but the p99 was quietly degrading week over week. the point about parallel calls being bounded by max(L1..LN) is worth emphasizing more — teams parallelize and think they've solved it, but under load that max keeps climbing. great breakdown of the math here.

Ovaise Qayoom • Apr 1

Great callout, especially on the Redis example. That is the kind of "harmless" hop that can end up dominating p99 over time.

The gap you're seeing between p50 and p99 is actually the problem. Everyone optimizes for p50, but the system is only as fast as its worst-case scenario, and if p99 is shifting, that usually means there are hidden dependencies at play.

I also agree with your comment on parallelization. It is still limited by "max(L1…LN)," and under load, that "max" is no longer predictable, just less predictable rather than faster.

I think your example actually reinforces the main point, though: it's not just the number of calls, it's the number that are on the critical path.

DEV Community