Christie Cosky

Posted on Jun 9 • Originally published at christiecosky.com

Understandable Systems Generate Evidence: How structure helps developers change code with justified confidence

#architecture #readability #maintainability #softwareengineering

(The following example is fictionalized.)

A notification template feature shipped six months ago. It let each tenant customize the messages sent to their own customers without requiring a back-end change every time the wording changed.

The code reviewer could tell the design was hard to follow, especially the path from template to rendered value. But "this is hard to follow" is difficult to turn into a concrete objection when the feature works, the tests pass, and nothing is obviously unsafe or wrong. The design risk was real, but there wasn't an obvious bug to point to.

QA signed off, and the feature went into production.

Then a bug report came in: one customer had received a notification containing another customer's information. Somewhere in the notification pipeline, the system was leaking PII.

At first, the fix sounded small: make sure notifications only render data belonging to the intended recipient.

Then the assigned developer, who wasn't the original author, started looking for the place to make the fix.

The templates were stored in the database. There were six template types, and each one populated its real values in a different part of the codebase. Some values came from customer-facing records, some came from internal workflow state, and some came from template-specific logic. The placeholder-to-value mapping lived somewhere else. Email and SMS channels shared part of the rendering path, but not all of it.

Before the developer could decide where to fix the leak, they had to answer a more specific set of questions:

Which placeholder rendered the wrong value?
Where did that value come from?
Which template types could use that placeholder?
Did email and SMS resolve it the same way?
What evidence would show that the leak was fully contained?

The system was hard to change because it made the behavior hard to understand.

What the developer needed was not just "clean code." They needed trustworthy signals they could use as evidence to answer harder questions: where the behavior lived, which paths shared it, what data was allowed to flow through it, and when their search was complete enough to make a safe change.

In a situation like this, designing code that runs is not enough. The system also has to preserve enough evidence for the next developer to understand and change it safely.

Good Design Generates Evidence

The rendering bug exposed a deeper problem: the system did not make the notification path traceable enough to change safely. The developer needed to understand how a template became a rendered message before they could decide where a safe fix belonged.

That happens a lot in software development. Often the hard part isn't the edit itself. It's understanding enough of the surrounding behavior to know where the change belongs and whether it is safe.

To build that understanding, the developer looked for signals in the codebase that they could trust.

A package name can give the developer a credible place to start. A class name can signal what kind of logic belongs inside. A shared enum can show which concepts are related. An explicit dependency can make side effects easier to see. A test can name behavior the system promises to preserve.

But those signals are not automatically evidence. A package name can be a catch-all. A class can drift from its original purpose. A test name can describe one case while actually testing another.

A signal becomes evidence when the code keeps the promise the signal makes. That's when the developer can reason from it.

Good design doesn't just make code cleaner. It makes the system's signals trustworthy enough for the developer to reason about where behavior lives, what belongs together, which paths are unlikely to matter, and how far a change might travel.

The signals a developer relies on change over the course of an investigation. Early on, they need help locating behavior and parsing logic. Later, they need help tracing consequences and deciding whether the remaining risk is small enough to stop searching.

Understandability Has Layers

Developers do not move through these layers in a neat sequence. They may start by searching, open a file, realize it is the wrong place, trace a dependency, and search again. But the layers describe different kinds of signals the system provides as the developer builds enough understanding to decide what can be trusted.

Perception: Can I Parse What I'm Looking At?

Before the developer can reason about the notification leak, they have to parse the code in front of them. Shape, grouping, indentation, spacing, and hierarchy show them what belongs together and where one idea ends.

If the rendering code is a dense block of conditionals, placeholder substitutions, and channel-specific branches, the developer has to reconstruct its shape before they can investigate the leak. They are spending attention on parsing before they can spend it on analysis.

When this layer fails, the developer spends cognitive effort reconstructing the shape of the code before they can reason about where the logic might be wrong.

Local Reasoning: Can I Hold the Relevant Ideas in Mind?

Local reasoning is what happens after the developer has found a piece of code and starts asking whether they can understand what that unit is responsible for.

A method named renderTemplate sounds narrow. It makes a claim about what kind of reasoning belongs there.

But if the method implements template loading, recipient lookup, tenant rules, placeholder resolution, channel formatting, missing-value behavior, and skip conditions inline, the body does not keep that promise. Each step may be necessary somewhere in the rendering workflow, but they don't belong at the same level of detail.

Good boundaries reduce cognitive load because they tell the reader what kind of reasoning belongs inside them. Rendering, recipient lookup, tenant rules, channel formatting, and delivery decisions require different mental frames. When one method mixes those frames at the implementation level, the reader has to hold too many mental models at once.

When this layer fails, reading turns into mental juggling.

Navigation: Can I Find the Right Place to Start?

In the notification bug, the developer could not start from "the rendering workflow" because template storage, placeholder mapping, data lookup, and channel-specific behavior were scattered across the system.

A navigable system makes the first plausible place easier to find, and it makes unrelated places easier to rule out. A notifications/templates package is a signal. It doesn't prove the leak is there, but it gives the developer a more credible starting point than TenantController, CustomerService, or MessageUtils.

Good design doesn't just help developers find the right places to change; it also helps them rule out the wrong ones sooner.

When this layer fails, too many places stay plausible for too long.

Propagation: Can I Trace Where This Change Can Travel?

Once the developer finds one place where the wrong data enters a template, the next question is where that value can travel. Does the same placeholder appear in both notification channels? Does a fix in the placeholder renderer affect both channels, or only one of them?

Good design makes propagation visible and bounded. If email and SMS both pass through the same placeholder renderer, the developer can see that a fix in the shared path may affect both channels. If they use separate channel-specific formatters, the developer can see that each path may need to be checked separately. The structure does not decide the fix for them, but it shows where the change can travel.

Hidden listeners, framework hooks, database triggers, and distant side effects may be necessary, but they make the change path harder to see.

When this layer fails, every edit carries an unknown blast radius.

Stoppability: Can I Know When I've Seen Enough?

Even after the developer understands where the data can travel, one question remains: have they checked enough to make the change safely?

Stoppability is where enough trustworthy signals accumulate to justify stopping the search. Clear names, bounded responsibilities, visible dependencies, consistent patterns, and trustworthy tests combine into a confidence threshold. When those signals point in the same direction, stopping becomes justified. When they conflict, continued searching is rational.

When this layer fails, small tasks become open-ended investigations.

Confidence Comes When Signals Agree

One trustworthy signal rarely carries the whole investigation.

A notifications/templates package may suggest where template behavior belongs, but it is not enough by itself. The developer still needs to know whether placeholder names are defined consistently, whether template types share the same rendering path, whether email and SMS share the risky behavior, and whether the tests cover the cross-tenant rule.

Confidence comes when those answers point in the same direction. When they do, the developer can start in the most plausible place, dismiss unrelated paths sooner, trace likely side effects, and stop when the remaining risk is small enough to act.

When the signals conflict, confidence becomes harder to justify. The package suggests one home for the behavior, but a relevant-looking class sits somewhere else, and several template types only appear as magic strings outside the package. The test names claim to cover cross-tenancy, but the assertions prove something else. The system may still run, but the developer doesn't have a coherent interpretation they can trust.

That is the difference between local readability and system understandability. Readable code is easier to parse. Understandable systems produce enough trustworthy evidence for developers to make justified decisions under uncertainty.

Understandability Depends on Trustworthy Signals

Time is the stress test for all of these layers.

The notification feature shipped six months earlier. The original author has moved on. The design tradeoffs and assumptions behind the implementation are no longer fresh in anyone's mind. The next developer has to reconstruct the behavior from the signals the system still provides.

Some signals are weak from the start. Entity-based structures like TenantController and CustomerService may look organized, but they often give poor evidence about where feature behavior lives.

Other signals lose trust over time. Formatting drift makes structure harder to recognize. Naming drift makes search less trustworthy. Boundaries weaken when classes absorb behavior their names no longer explain. Hidden dependencies make change paths harder to predict. One-off exceptions make it harder to know whether the behavior you found represents the rule or just another special case.

The drift usually happens through changes that look harmless in isolation. One duplicated placeholder mapping. One special case. One class with a name broad enough to hold anything. Each one makes the system a little less trustworthy as a guide for the next developer.

That is how "hard to follow" becomes operational risk. Sometimes the structure was misleading from the beginning. Sometimes it eroded slowly. Either way, the system doesn't provide reliable signals about its own behavior, and the next developer making a change is more likely to miss a path, misunderstand the logic, or introduce a bug.

Understandability has to be preserved over time. When it's not, confidence gets harder to justify.

AI Raises the Stakes for Trustworthy Evidence

AI-generated code can accelerate that drift. It can generate plausible signals very quickly, but plausible signals are not the same as trustworthy evidence. A generated class name can make a precise-sounding claim without preserving a real boundary. A generated test can name an important behavior without proving it.

Reviewing AI-generated code means checking more than correctness. It means checking whether the names, boundaries, and tests are telling the truth.

If AI adds a TenantScopedTemplateRenderer and a test named whenTenantDataCrossesBoundary_rejectsValue, those names make claims about where behavior lives and what behavior is protected. The reviewer has to decide whether the code actually keeps those promises. The class boundary may keep tenant-scoped rendering separate from recipient lookup and channel formatting, or it may only rename a workflow whose responsibilities are still mixed together. The test may prove the cross-tenant rendering rule, or it may be checking something else entirely.

Code reviewers of AI-generated code should not ask only whether the new structure looks cleaner. They should ask whether it gives the next developer better evidence to reason from.

If the names, boundaries, and tests preserve trustworthy evidence, AI can support understandability. If they only create the appearance of structure, humans still have more code to verify and reinterpret before they can trust it.

How This Changes Code Review

Understanding what makes a system understandable gives code reviewers better language. Instead of stopping at "This is hard to follow," they can point out which kind of evidence the design is failing to provide:

Perception: Is the code shaped clearly enough to parse?
Local reasoning: Does each boundary contain one kind of reasoning?
Navigation: Can a future developer find where this behavior lives?
Propagation: Are the paths and side effects visible enough to trace?
Stoppability: Can a developer tell when they have checked enough?
Signal agreement: Do the names, boundaries, dependencies, and tests point to the same interpretation?

These questions turn a vague discomfort into something concrete. They do not guarantee a perfect design, but they give reviewers better ways to talk about risk before the risk turns into a bug.

Systems That Remain Understandable

The notification bug is one example of the larger design problem underneath this series.

The path from template to placeholder to data source to recipient was not visible enough for the next developer to reason about quickly under pressure. When something went wrong, the system did not preserve enough evidence for the next person to trace what had happened.

Formatting, naming, boundaries, cohesion, navigability, and stoppability can look like separate concerns, but they all contribute to the same larger goal. Each one either strengthens or weakens the system as a source of evidence for the next developer.

Code has to execute correctly. Long-lived systems also have to preserve evidence for the people who will change them later, after memory has faded, authors have moved on, and the original context is gone.

A system is sustainable when it can keep growing without erasing the evidence future developers need to change it safely.

AI was my editor, but these ideas are my own.

This article is part of a broader series exploring how code structure, navigability, and cohesion align with cognitive limits.

If you're interested in the deeper dive, the full series is here:
Designing Code for Human Brains

Top comments (4)

Adam Lewis • Jun 11

Really enjoyed this one, Christie. The point about a signal only becoming evidence once the structure around it lets you trust where it came from is the one that stayed with me. It's also the argument for the folders saying what the system does rather than how it's layered, the screaming-architecture idea, so the place a reader orients themselves is the same place the evidence lives. I've found that matters even more with an agent writing some of the code, it reads the structure cold every session and takes whatever orientation the folders give it. (Restoring this comment, I deleted it by accident.)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.