The Illusion of Scale, Part 5: The System That Outlives the Team

#sustainability #government #systemsthatlast #security

A few years ago I built an electronic search warrant system for a state law enforcement agency. Paper process, courthouse logistics, hours of waiting -- we turned it into two minutes. Designed it, built it, deployed it, handed it off.

That system is still running. Eight years later. Extended to handle warrant types that didn't exist when we built it. The original team moved on years ago(most of them retired). The requirements document is probably in a folder nobody opens anymore.

Still works.

I've thought about this a lot. Why did that system survive when so many others didn't? I've built things that were arguably more technically interesting that got rewritten within two years(in move fast and break things environment). What was different?

This is the final post in a series about assumptions that quietly break systems at scale. But this one's about something different: what makes a system last.

The second write (or: the time I broke production by being too smart)

Before I talk about what we did right, I want to talk about something I got wrong somewhere else. Because I think the mistake is more instructive.

I inherited a codebase that had a pattern that looked like a bug. A service was writing to two tables in a way that appeared redundant. The second write looked like a copy of data that already existed elsewhere. Dead code, clearly. Technical debt from someone who didn't clean up after themselves.

I removed it. Tests passed. I moved on feeling like I'd done something useful. Feeling productive.

A week later we had a data consistency incident in exactly the scenario the second write had been protecting against. No documentation explaining it. No comment in the code. No ADR anywhere. The engineer who'd written it had left the company.

The second write was load-bearing and completely invisible. And I'd ripped it out because I was clever enough to see it was "redundant" but not wise enough to wonder why it was there.

The code told me what the system did. It said nothing about why it did it that way. At scale, the gap between those two things is where SEVs live.

The question that changes how you build

Most engineering teams design systems for themselves. The decisions about structure, naming, and abstraction are implicitly optimized for the team that currently exists, because that's the team building it and living with it right now.

The problem is: that team won't be there forever. People leave. Priorities shift. The team that inherits your system doesn't have your context, can't ask you questions, and has to reconstruct your intent from whatever you left behind.

Usually that's not much.

Systems that survive long enough to matter ask a different question during design: not "does this make sense to us?" but "will this make sense to someone who wasn't here?"

That one shift changes specific things. How you name things. How much you externalize versus bake in. Whether your runbooks are written for the team that built the system or for an engineer encountering it cold at 2am with an incident open and nobody to call.

Configurable beats clever (and it's not even close)

The search warrant system was designed to be configurable almost to a fault. New document type? Configuration, not code. New approval chain? Configuration. New workflow state? Configuration. We went kind of overboard with it, honestly.

The argument wasn't technical sophistication. It was a simple belief: requirements will keep changing and we won't always be there to change the code.

Eight years later and no rewrite. That belief held.

Here's the thing about clever code: it achieves things concisely in ways that are satisfying to write. I enjoy writing clever code. It makes me feel smart. But it's also hard to understand six months later when you've forgotten the context, and nearly impossible for someone who was never there at all.

Configurable code is boring to write. It's verbose. It's repetitive sometimes. It's what you find in systems that are still running years after the team that built them moved on.

The team that inherits a clever system has to reverse-engineer intent. The team that inherits a configurable system can mostly just read it. I know which one I'd rather inherit at 2am.

The documentation that actually survives

Engineers intend to write documentation. The road to technical hell is paved with good documentation intentions. It either doesn't happen, or it gets written once and reflects the system as designed rather than the system as deployed. Six months later it's wrong and nobody updates it because they're not sure what the current accurate version should say.

The documentation that actually survives and stays useful isn't a comprehensive spec. It's the decision log.

What problems did the original team hit? What did they consider and reject? Why did they make the choices they made? What tradeoffs did they accept on purpose?

This information lives in people's heads. When those people leave, it leaves with them. The inheriting team doesn't know why the system works the way it does. They only know that it does. And they will "fix" things that were intentional and reintroduce bugs that were already solved, usually at the worst possible time.

An architecture decision record doesn't have to be formal. A short document: the context, the options considered, the decision made, the tradeoffs accepted. Written at the time the decision was made, when context is fresh. Stored somewhere the next team will actually find it.

One ADR prevented two separate teams from making the same mistake on one of my systems over a span of three years. That one document was worth more than any other single artifact in the codebase. And if I'd written one for the system with the second write, a week of incidents could have been a five-minute read.

Observability is a gift to your future strangers

When something goes wrong in a system you built, you know where to look. You know which metrics matter, which log lines carry signal, which alerts mean something real versus which ones you can ignore.

When something goes wrong in a system you inherited, you start from zero. Every metric might matter. Every log line might be the one. You have no instinct for it yet.

Systems that outlive their teams are systems that explain themselves. Meaningful metrics with names that tell you what they mean. Structured logs with enough context to reconstruct what happened. Trace IDs that follow a request through every component it touches.

These aren't features. They're the difference between a system that can be operated by someone new and one that can only be operated by the person who built it. Which is fine until that person is no longer there. And that person will always, eventually, no longer be there.

What this series has really been about

Each post in this series has been about a different version of the same problem: an assumption that was harmless when the system was small and expensive when it grew. The data model that encoded a belief about cardinality. The permission model that assumed roles would stay simple. The latency budget that was validated in isolation and wrong under real load.

The systems that survive are not usually the most architecturally impressive ones. They're the ones where someone spent time on the things that don't show up in demos. Configuration over hardcoding. Decision logs over implicit knowledge. Observability over optimism. Simplicity over cleverness, everywhere it was possible to choose.

The search warrant system is still running because the team that built it made boring decisions on purpose. Nothing in it is clever. Everything in it is as simple as we could make it while still meeting the requirements. We spent complexity only where we genuinely had to.

That's the thing hyperscale teaches you, eventually: the goal is not to build impressive systems. The goal is to build systems that keep working when you're not there anymore.

That's a wrap on "The Illusion of Scale" five posts, five assumptions, one intent. Thank you for following along!

Happy building!