The Illusion of Scale, Part 2: When Your Data Model Becomes Your Bottleneck

#database #distributedsystems #architecture #backend

I want to talk about the cruelest kind of technical debt. Not the kind where someone wrote bad code, and you can see it. The kind where the code is clean, the tests pass, the results are correct, and you're still screwed.

Data model debt.

It hides. For months, sometimes years. It doesn't announce itself. It just sits inside perfectly functional code, returning correct results, passing every test. And then one day, you realize everything else is built on top of it, and you cannot move it without moving everything.

This is Part 2 of a series about assumptions that quietly break systems at scale.

The customer who broke our schema

A few years into working on a multi-tenant system, we onboarded a large enterprise customer. System had been running great for over a year at that point. Hundreds of tenants, smooth operations, no major incidents. We were feeling pretty good about ourselves.

This customer had fifty million records in a table where our typical tenant had maybe fifty thousand.

Same schema. Same queries. Same everything. But queries that ran in 200ms for every other tenant were running in 45 seconds for them.

Nobody had designed a bad system. The schema had just quietly encoded a belief: that tenants would be roughly similar in size. That belief had never been written down anywhere. Never tested. Never questioned. It was just... assumed, the way you assume things that have always been true until they suddenly aren't.

The fix was conceptually simple -- partition the data, route large tenants differently. The implementation took months, because everything else had been built around that original schema. Every query, every index, every join had opinions about how the data was structured. We ended up running two schemas simultaneously for six weeks to migrate without downtime.

It was the most expensive technical debt I've ever watched get paid off. And I'm including the time someone accidentally dropped a production table (different story, different company, different bottle of wine).

Why "good design" has an expiration date

Here's the thing about data models: they're designed for the use cases the team can see right now. That's almost always the wrong frame, because the use cases that matter at scale are the ones nobody anticipated when the schema was first written.

The pattern is incredibly consistent. System starts with a well-normalized schema. Foreign keys everywhere. Third normal form. At moderate load, it's fine. Correct, even. Textbook stuff.

Then volume grows. Queries that touched thousands of rows now touch millions. Joins that were fast become table scans. The query planner starts making choices that surprise you, and suddenly you're reading execution plans at midnight -- midnight -- trying to understand why a query that used to take 80ms now takes 12 seconds.

Normalization optimizes for write correctness and storage efficiency. Not read performance at volume. When your read load is enormous relative to your write load -- which it is in basically every user-facing system -- those goals pull in opposite directions. You find out which one your schema actually prioritized the hard way. Usually on a Friday.

The cardinality time bomb

Okay, this one's personal because I've made this exact mistake.

A permissions table with one row per user-resource pair. Fine when users have tens of permissions. Completely reasonable design. Then fine-grained access becomes a product requirement and users can have thousands of them. Table gets enormous fast.

Every permission check is now a large query. Every access decision slows down. And because authorization sits in the critical path of almost everything, a slow permissions table makes the whole system feel sluggish in ways that are incredibly hard to diagnose. You end up chasing phantom performance issues across half the codebase before someone finally traces it all the way back to a table that's just too big to query efficiently anymore.

The schema wasn't badly designed. It was designed for a world where users had 10-20 permissions. Then the product team said "actually, we need thousands" and the schema didn't get the memo.

When you design a schema, there are two questions: "what cardinality do I expect?" and "what cardinality could this legitimately reach?" They're not the same question. The first one is optimistic. The second one saves you.

When being correct gets too expensive

If producing the accurate answer requires joining five tables and aggregating across millions of rows... correctness has a real cost. A cost you pay on every single request.

Your options at that point are denormalization, pre-computation, materialized views, or derived tables. They all work. They all introduce consistency challenges that the normalized schema never had. That's the actual tradeoff, and it's worth naming clearly: not "normalization vs. performance" but "easy to get right" vs. "fast under real load."

Choosing consciously is very different from discovering the tradeoff at 3am during an incident. Trust me on this.

Migrations: where you pay the real price

A migration that takes 30 seconds in development can take three weeks in production. Not because the operation changed. Because the table grew from thousands of rows to billions, and suddenly every part of the process has consequences you never thought about.

Locking is the first problem. DDL operations on large tables can block reads or writes even briefly. "Briefly" on a hot table cascades into timeouts across the entire system within seconds.

Backfill is the second. Writing a new column's default value to a billion rows is a lot of I/O competing directly with live traffic.

And then there's the dual-write period -- running old and new schemas simultaneously so you can migrate without downtime. This is the right approach. It's also the approach that reveals every single implicit assumption in your application code. Things you didn't know your code believed about the schema. Fun.

It almost always happens under pressure too. Nobody says "let's do a major schema migration" when things are going well. They say it when things are on fire. Plan for it before you're in that situation. You won't, but you should.

What I'd tell my past self

Design for your read patterns, not just your write patterns. Know which queries are on your critical path and whether your schema serves them cheaply or with heroics.

Write down your cardinality assumptions explicitly before you ship. Explicitly. In a document. "This table is expected to have X rows per tenant. At Y rows, query Z will degrade." If you can't fill in those numbers, the answer to "will this hold at scale?" is also unclear.

Separate your operational and analytical models early. The schema optimized for transactional correctness is rarely the schema optimized for reporting. Trying to serve both from one schema is a compromise that satisfies neither at volume.

And treat major schema changes as an operational project, not a technical task. They need a plan, a rollback strategy, a communication plan, and ideally someone who has done it before and can warn you about the part you haven't thought of. There's always a part you haven't thought of.

Next up: why access control is one of the most quietly expensive places for schema assumptions to go wrong at scale. Spoiler: 15 roles became 340.

What data model decision have you had to undo the hard way? I want the painful stories. The "we ran two schemas for six weeks" stories. The more awful, the more I want to hear it.