Three Questions I Ask Every System. Most Design Reviews Skip All Three.

#softwareengineering #programming #software

The design doc is fourteen pages. Clean service boundaries, thoughtful API contracts, a deployment story that handles rollback without incident. Six months of work. The team is proud of it, and the work is genuinely good.

Three questions will tell more about this system than all fourteen pages.

Not questions about implementation details or technology choices. Questions about how the system was designed to age, to fail, and to be understood by someone who didn’t build it. Most architecture conversations never reach these questions, which is part of why the gap between a good system and a great one is often invisible until it is suddenly very visible.

What does this make hard?

Good architecture conversations focus on what a design enables: faster deployments, independent scaling, and clearer ownership. The question that separates architectural thinking from implementation thinking is the inverse. What does this make hard?

Every decision forecloses options. The service boundary that gives teams autonomy makes cross-service transactions expensive. The data model that reads cleanly under expected load makes certain write patterns awkward. The abstraction that simplifies onboarding makes some categories of refactoring nearly invisible as possibilities. These are not arguments against the decisions. They are the other side of every decision, and that other side exists whether or not anyone names it.

The design doc that holds up over time is not the one where nothing is difficult. It is the one where the difficult things are named. “This approach makes distributed transactions impossible, and here is why we have decided to accept that constraint.” That sentence, or something close to it, belongs in every significant architecture document. Its absence is not a sign that the constraint wasn’t considered. Often it was. But without the name, the constraint becomes invisible to everyone who wasn’t in the room, which eventually includes the original team.

When joining a system someone else built, this question becomes the fastest diagnostic available. The parts that resist change are usually the parts that were never named as decisions. They were treated as implementation details, evaluated only against current constraints, never against the options they were foreclosing. Now someone is paying the cost of those closed options, with no record of the deliberation that led there.

The gap between naming the hard things and discovering them later is roughly the gap between a team that makes decisions and a team that finds out, eventually, what decisions were made for them.

The Architecture of Decisions series builds the thinking behind this question across six parts, from reversibility to what happens when context shifts around a decision that was once clearly right.

What happened the last time this changed?

In a production incident, the most useful question in the first five minutes is rarely “what broke?” The more useful question is: what changed? Deployments, configuration drifts, upstream behavioral shifts, schema migrations. The answer is almost always something changed, and the investigation proceeds from there.

Systems that can answer this question quickly have a structural advantage. Not only operationally but architecturally. The ability to reconstruct the sequence of significant state changes, to query a coherent record of what the system decided to do and when, is a design property, not a monitoring afterthought. Teams that keep events as first-class artifacts are building systems that cooperate with investigation rather than resisting it. This is not the same as logging. Logging records what happened inside a single service. A well-designed event record tells you what the system, as a whole, decided to remember about itself.

The same question applies to inherited systems, though the answer looks different. An inherited codebase is a fossil record. The odd dependency bump happened for a reason. The data model shaped strangely around one field has a story. The abstraction that seems overcomplicated for the current load was probably right at the time it was written. Whether the story is recoverable depends almost entirely on whether anyone bothered to record it.

Engineers who inherit systems well are not faster code readers. They are better at reading history: tracing what changed, what the artifacts reveal about the optimization pressures that were present at the time, and which decisions are still actively load-bearing. The answer to “what happened the last time this changed?” is, in most systems, either immediately available or effectively lost. There is not much middle ground.

When a system cannot answer this question under pressure, that is a design choice too. The history exists. It is just in old Slack threads, half-remembered postmortems, and the tacit knowledge of the two people who haven’t left yet.

The Systems You Inherit series covers how to read inherited codebases as decision archaeology.

What problem aren’t we asking about?

Every design review is a focused conversation. The team is solving a defined problem, and the discussion stays inside that definition. This is appropriate. Scope discipline is real and valuable.

The third question lives outside the frame of the defined problem, which is what makes it uncomfortable to raise in a productive design review, and what makes it the one most likely to matter later.

The load pattern dismissed as edge case, because the current scale makes it irrelevant. The regulatory consideration no one flagged as applicable. The adjacent team is building something that will need to integrate with this service in eight months, whose requirements are nowhere in the current design brief. The failure mode that lives not inside any single service but in the handoff between services, which means it belongs to nobody and therefore to everybody eventually.

This question is not a checklist item. A checklist version exists, and it is better than nothing, but it misses the point. What asking this question regularly builds is a habit, not a procedure: a tendency to notice what is outside the current frame before committing to something inside it. The engineers who ask it are not performing thoroughness. They have learned, usually from enough incidents where the problem turned out to be the one nobody mentioned, to distrust the first read of a situation.

The systems designed by teams with this habit tend to show it in their documentation. Explicit non-goals. Named failure modes. Records of what the architects knew they didn’t know at the time. Not a guarantee against surprises. A record that surprises were anticipated, even when the specific ones weren’t.

This is also the hardest question to ask well, and I want to be honest about that. There is a version of it that becomes paralytic generalization, “but what if everything is wrong,” which is not useful. The productive version is more specific: what is present in the environment around this system that is not in the current design brief, and does it have a claim on the design? That question has a workable scope. Most of the time the answer is nothing we haven’t considered. When it isn’t, that is the conversation worth having before shipping.

The Senior Engineer’s Toolkit series covers the shift from “what should I do here?” to “what am I not seeing?”, including why the second question is harder than it looks.

Three questions. Not a rubric, not a certification for great systems. More like a minimum viable diagnostic: a quick orientation when joining an unfamiliar system, walking into an architecture review cold, or trying to understand why a system that looks solid keeps surprising its operators.

The systems that age well do not tend to have been built by smarter or more experienced teams. They tend to have been built by teams that asked these questions before they shipped, and designed accordingly: making the hard things visible, keeping the record of what the system knows about itself, and leaving room for the problems they couldn’t yet see.

Good systems solve the problem they were given. Great systems were designed by people who kept asking the questions the problem statement didn’t include.

The distance between those two tends not to show in the architecture diagrams. It usually shows up in the first postmortem. Though the postmortems still happen. Just with better documentation.