The Illusion of Scale, Part 4: Latency Is a Design Decision, Not a Measurement

#distributedsystems #performance #architecture #backend

I need to tell you about the time I confidently presented a latency budget to a stakeholder and then watched it disintegrate in production like wet tissue paper.

We had a system with a 200ms latency budget. We'd measured every component. Auth service: 15ms. Business logic: 30ms. Database query: 40ms. Total: well inside budget. I remember feeling good about this. We'd done the work. We had numbers.

We shipped it. In production, the auth call that took 15ms in testing was regularly hitting 200ms at peak. So I panicked, ran profiling tools, even on-boarded new profiling tools coz LLM agents did not exist back then and I was desperate and sure that a block of code was causing it, maybe writing to DB, reading from DB, calling an API, something.. something that would explain this, but there it is.. the bitter truth!

The auth service was slow. Because it was shared with four other services, all of which peaked at the same time, and nobody, I mean nobody, including me -- had reserved any capacity for ours. We had 15ms allocated. We were spending 200ms. The rest of the budget was irrelevant at that point.

That was the moment I stopped treating latency as a measurement exercise and started treating it as a design problem. Measure-and-optimize sounds like engineering rigor. In practice, it's usually "discover your architectural constraints too late to change them cheaply."

This is Part 4 of a series on the assumptions that quietly wreck systems at scale.

You can't optimize your way out of a bad structure

The instinct is totally reasonable. Build the thing, run load tests, measure latency, optimize what's slow. Feels like good engineering discipline. I've given this exact advice to junior engineers.

The problem is: by the time you're measuring in production, the decisions that created the latency are three layers deep in the architecture. Changing them means rewriting things that other things depend on, under load, with users waiting. That's not optimization. That's reconstruction. And it happens at the worst possible time -- when you're already under pressure to deliver and 3 days away from piloting the product.

Load tests seldom catch the real issue either. They model the traffic you imagined. Production brings shared dependencies, concurrent spikes, and usage patterns that your test suite never considered because honestly, why would it? You test what you know. Production teaches you what you didn't(at the most inconvenient time possible).

Where latency actually lives (hint: not where you think)

The obvious suspects -- slow queries, unoptimized loops, API calls with bad timeouts -- those are worth fixing. Sure. But they're usually not the interesting problem.

The interesting latency problems are structural. They're baked into how the system is organized before anyone writes a line of code.

Chattiness. A user-facing request that requires eight internal service calls to complete has a latency floor equal to the sum of those calls. You cannot optimize below that floor. No amount of caching or connection pooling or index tuning changes the fundamental math. You have to redesign the call structure. Which is a very different conversation than "let's optimize the hot path."

Unbounded fanout. A query that touches N records where N is controlled by user input is fine in development, where every test dataset is small and tidy. In production, one legitimate power user has an N that's ten thousand times your assumption, and the query that runs in 20ms for everyone else runs in three minutes for them. And -- I love this part -- they're usually your most important customer. So the conversation about "we need to add limits" becomes a very political discussion very quickly.

Synchronous waits on async work. This is the quietest killer. If your system waits synchronously for something that's fundamentally asynchronous -- a write to propagate, a downstream service to confirm, a cache to warm -- you've put a hard ceiling on your response time. No optimization lifts that ceiling. You have to change the boundary between sync and async, which is one of those decisions I mentioned in Part 1 that's genuinely hard to reverse.

Latency budgets: think before you build, not after

Here's what actually works: decide your latency budget before you build, not after and give yourself some buffer, coz trust me., you are going to need it.

Take your target response time. Allocate it across each component in the critical path. Every component has its own number. Write it down. Put it somewhere people will see it.

What this surfaces immediately: shared dependencies. When two components share a downstream resource, their budgets aren't independent. The budget math that looks fine for each component in isolation falls apart when they both spike at the same time. That's exactly what happened with our auth service. If we'd done this exercise before building, we would have caught that the auth service was shared and had no capacity isolation. We would have had a conversation about it. Maybe we would have made the same choice, but at least it would have been a choice and not a surprise.

Writing the budget down also forces tradeoffs into the open before anyone's committed code. Maybe something expensive moves off the critical path and gets computed asynchronously. Maybe you denormalize data you'd rather not. Those are real conversations worth having before the code exists.

I know this sounds like process for process's sake. It's not. It's the difference between "we chose to accept this tradeoff" and "we discovered this tradeoff during an incident at 2am."

The number that should scare you

10ms of unnecessary latency at 100,000 requests per second is 1,000 seconds of user wait time per second of operation.

Let that sink in for a second. One thousand seconds of wasted human time, every second your system is running.

That's not a performance problem. That's a customer problem. It's why teams at real volume spend weeks on single-digit millisecond improvements and can justify every hour of it. When someone asks "is 10ms really worth optimizing?" the answer depends entirely on your volume. At low traffic, no. At high traffic, it's one of the highest-leverage things you can do.

The conversation I couldn't answer

There was a point where a stakeholder asked us why the system was "sometimes fast and sometimes slow with no obvious pattern." We couldn't answer cleanly. Not because we didn't understand the code -- we did. We just hadn't modeled what the components did to each other under concurrent load.

The answer turned out to be resource contention between two services that looked completely independent on the architecture diagram. They shared a database. Nobody had documented that as a latency dependency. It had just been built that way, probably seemed fine at the time, and nobody had flagged it.

I spent an embarrassing amount of time looking at application code when the problem was infrastructure topology. Once I found it, the fix was straightforward. But the finding took days because I was looking in the wrong places.

After that experience, every shared dependency in a critical path gets an explicit owner and an explicit budget in any system I work on. Not because it's an elegant process. Because the alternative is standing in front of a stakeholder at 2pm on a Tuesday unable to explain why the system is slow in ways you can't describe.

Final post next week: the systems that outlive the teams that built them, and what the ones that survive actually have in common. (Spoiler: it's not architectural cleverness.)

Where did latency surprise you? What was the shared dependency nobody had mapped? I've started keeping a list and it's getting disturbingly long.