The Illusion of Scale, Part 1: When Your "Scalable" System Isn't

#ai #softwareengineering #distributedsystems #architecture

I want to talk about something that's been bugging me for a while.

There's this moment -- and if you've been in this industry long enough you know exactly what I mean -- where a system that looked rock solid just... stops working. Not dramatically. Not with a big crash and a SEV page at 3am (though sometimes that too). It's more like a slow suffocation. Latencies creep up. Queues get deeper. Someone opens a ticket that says "it feels slow" and you roll your eyes because everything feels slow to users, but then you look at the graphs and oh. Oh no.

I've been on both sides of this. I spent years working on public-sector infrastructure -- criminal justice workflows that had to work across 87 counties in a state, which sounds boring until you realize that "87 counties" means 87 different usage patterns, 87 different peak hours, and at least 12 counties who will absolutely hammer your API in ways you never anticipated. More recently I've been in enterprise AI infrastructure, where the fun game is "this API call costs $0.003 and we make it 40 million times a month, do the math."

Both times, the system didn't fail because we forgot to add servers. It failed because of something dumber.
This is the first in a series I'm writing about scale assumptions. I don't have a clever acronym for it. It's basically: the decisions that seem fine when you're small and make you want to quit your job when you're big.

Linear thinking will absolutely wreck you

Here's the thing nobody tells you early in your career: scaling is not a linear problem, and your intuition about it is almost certainly wrong.

A system handles 1,000 req/s. So 100,000 is just... more machines, right? Tune some indexes, maybe bump the connection pool, call it a day?

Sometimes, honestly, yes. I've had that experience and it's great. You feel like a genius. "We just horizontally scaled it." High fives all around.

But more often -- and this is the part that took me embarrassingly long to internalize -- the bottleneck isn't compute. It's a design choice someone made in week 2 of the project that seemed totally reasonable at the time.

I'll give you a specific example because I think abstractions are useless here.

We had a system running in pilot with one county agency. Worked beautifully. Fast, stable, everyone's happy. We expand to three agencies. Same code. Literally the same code, no changes. System slows down noticeably.

I remember staring at the metrics genuinely confused. Nothing changed! What is it then?

What changed was width. Three agencies meant three times the concurrent load on shared workflow components. Database access patterns that were totally fine with one agency's usage started colliding. Integration points that had been sized for one agency's volume were now contested. It wasn't a bug. It was an assumption -- that the system would scale linearly with tenants -- that nobody had written down because nobody had thought to question it.

That was the week I started losing sleep about the statewide rollout. Not because the architecture was bad -- it was actually pretty solid for what it was designed for -- but because "what it was designed for" and "what it was about to face" were diverging fast.

The synchronous call in the hot path (a.k.a. my nemesis)

Okay, pet peeve time.
A 50ms synchronous call to a downstream service. Totally fine at low traffic. You barely notice it. It's in the critical path but hey, 50ms, who cares.

Then traffic goes 10x and suddenly that 50ms dependency is your ceiling. Every request is waiting on it. When it has a bad day, you have a bad day. When it times out, you time out. And the really fun part: by the time you realize this is the problem, it's woven into everything. You can't just "make it async" without rearchitecting half the request flow.

I don't have a clean solution here. I just have scar tissue.

Data models: where optimism goes to die ####

I need to rant about schemas for a second.
Every bad scaling story I have eventually comes back to the data model. Not because anyone designed a bad schema -- usually the schema was perfectly sensible for the requirements as understood at the time. The problem is that schemas encode beliefs about the future, and we are terrible at predicting the future.

Beliefs like:
●      "We'll only have a handful of roles" (we now have 47)
●      "This workflow has 4 states" (it has 11, plus 3 that are technically illegal but exist in prod)
●      "This lookup will always be fast" (it was, until someone added a tenant with 2M records)
These aren't mistakes. They're reasonable bets that didn't pan out. But the wreckage is the same either way.

The logging bill

This one still makes me laugh in a pained way.
You start a project. Good engineering culture. "Let's log everything, we'll need it for debugging." Absolutely correct instinct! Gold star.
Fast forward 14 months. Someone pulls up the infrastructure bill and goes "uh, why is our logging pipeline costing more than our actual application?" And everyone looks at each other. Nobody planned for this. Nobody put "the audit trail will eventually need its own architecture team" on any roadmap. It just... happened. Slowly, and then all at once.

p99 is not a rounding error

I used to think about p99 the way most people do: as an edge case. The unlucky 1%.

Then I did the math on a system doing 100k req/s and realized that 1% is a thousand requests every second getting a bad experience. Those aren't theoretical users. They're filing support tickets. They're hitting retry. Their retries are making other requests slower. The p99 tail is generating its own secondary workload that feeds back into the system.

Your unhappy path, at scale, is a system unto itself. That realization changed how I think about optimization priorities pretty fundamentally.

What actually breaks (spoiler: it's never what you tested)

Look. I have never -- not once in my career -- seen a system fail in production the same way it failed in load testing. The tests always pass because test traffic is polite. Real traffic is feral.
Real traffic is: retries stacking on retries. One tenant with 10x everyone else's data volume. A permissions edge case that only fires for one specific role combination that nobody on the QA team had. Duplicate events from an upstream that swore they'd deduplicate on their end. Events arriving out of order because someone's clock is wrong.

The thing I got most wrong, personally: I assumed a decision-making component would maintain consistent latency as we onboarded more systems. In isolation, it was fast. Really fast. What I didn't think about was what happens when multiple systems are doing concurrent writes to the shared database underneath it. The component was fine. The contention was the problem. And you can't see contention in a single-system test. By definition.

I think the broader lesson -- and sorry if this sounds hand-wavy but I genuinely believe it -- is that at scale, failures happen in the interactions between components. Not in the components. A retry policy that's totally safe in isolation starts amplifying failures when combined with another service's retry policy. Cache invalidation creates cascading churn nobody modeled. A permission check that's microseconds alone shows up on flame graphs when it's called 50,000 times per second.

There's one debugging session that broke my brain a little. Access control issue. Could not figure out where to even look. Turned out we had multiple sources of truth for permissions and they'd drifted apart. The system was just... checking whichever source it hit first. There was no canonical answer to "does this user have access." I had to reconstruct the state of three different systems at a specific timestamp to understand one decision the system had made.
That was when I realized: past a certain scale, you stop debugging code and start debugging emergent behavior. And that's a fundamentally different skill.

So what do you do about it?

I'm not going to tell you to design for massive scale on day one. That's almost always wrong. YAGNI is real. Premature optimization makes systems worse, not better.

But.
Some decisions are genuinely hard to reverse. And you should at least know which ones they are:
●      Your data model (migration under load is hell)
●      Sync vs. async boundaries (you can't easily untangle these later)
●      Consistency vs. availability tradeoffs (distributed systems don't let you change your mind cheaply)
●      Authorization architecture (this one always comes back to haunt you)
●      Audit and retention strategy (see: logging bill, above)

Get any of these wrong and the rewrite happens under pressure, in production, while users are affected, with half the team arguing about the approach and the other half on PTO. It's never the calm six-month project you pitch to leadership.

Next time I'll write about the one that's cost me the most career stress: data modeling decisions that look totally fine on day one and become load-bearing walls by year three. I have stories.

Genuinely curious -- what's the scaling assumption that burned you worst? The one where you looked at the system and went "oh no, this was baked in from the start"? Drop it in the comments, I collect these like trading cards at this point.