DEV Community: Ian Johnson

Custom behavior without custom code

Ian Johnson — Mon, 18 May 2026 16:36:06 +0000

Every successful SaaS product eventually meets the same question: a customer asks for something specific to them, you build it, and now you have a feature in your codebase that's only meant to run for one tenant. A year later, you have a dozen of these. The codebase has if-statements checking tenant IDs, the test suite mocks out customer-specific paths, and the senior engineer who knows which branch belongs to which customer is the only person who can refactor anything.

There's a better shape, and it doesn't require giving up the per-customer customization. It does require separating, cleanly and firmly, the code that defines what behaviors are possible from the data that selects and parameterizes them. This article is about how to do that, where to store the data, and the security cliff you'll fall off if you let the data become code.

What not to do

A handful of approaches show up over and over, and each has a fatal flaw:

Separate deployed instances per customer. This solves customization by forking the operational surface. Now you have N versions of the database, N sets of background jobs, N deploy pipelines, N versions of every bug fix to roll out. It works for two or three customers and collapses by ten.
Conditional code in the backend — if tenant_id == "acme": .... Cheap on day one, untenable by month six. Every developer has to know the customer landscape to make changes safely. Every refactor is risky in proportion to how many tenants have branches. Customer-specific logic spreads across the codebase by capillary action.
Code injected at build time. A configuration that produces a different binary per tenant. Has the same operational cost as separate instances, plus the added joy of debugging behavior that depends on what compile-time flag was set. Don't.

The pattern that scales is to keep one codebase, one running cluster, one deploy pipeline — and to let per-tenant behavior live in data that the code consults. Basically, I am describing multi-tenancy.

Code defines the possibilities; data selects among them

Identify the points in your system where behavior can vary per tenant. These are extension points: the discount engine, the approval workflow, the export format, the notification rules. At each one, your code defines a small set of behaviors it knows how to perform. Per-tenant data picks which behaviors to use and supplies the parameters.

Concretely: a class hierarchy. A common shape is a CustomRule base class with a contract — say, applies?(context) and apply(context) — and a set of concrete implementations:

class CustomRule:
    def applies(self, context) -> bool: ...
    def apply(self, context) -> None: ...

class PercentageDiscountRule(CustomRule):
    def __init__(self, percent, min_order):
        self.percent = percent
        self.min_order = min_order

    def applies(self, context):
        return context.order_total >= self.min_order

    def apply(self, context):
        context.discount += context.order_total * (self.percent / 100)

class FirstPurchaseDiscountRule(CustomRule):
    def __init__(self, amount):
        self.amount = amount

    def applies(self, context):
        return context.customer.order_count == 0

    def apply(self, context):
        context.discount += self.amount

A tenant's configuration is then a small declarative description — which rules they have, with what parameters:

{
  "discount_rules": [
    {"type": "percentage", "percent": 10, "min_order": 100},
    {"type": "first_purchase", "amount": 5}
  ]
}

At runtime, you load the tenant's config, hydrate it into instances of the right rule classes, and run them. The code knows how to perform every behavior; the data says which behaviors to apply, in what order, with what parameters. To add a new kind of rule, you add a new class. To add a new tenant configuration, you change data — no deploy, no migration, no engineering.

Notice that the apply methods mutate the incoming value. If you prefer to not do so, just return that result and apply it when called. A reasonable name for this operation is result. This is really up to your preference in terms of using mutable vs immutable data. In the context of a web app, you usually do want mutability (for example, encoding and decoding a value from the database to a particular meaning for a tenant). If there is more complexity, you can put it behind a port to unit test it separately.

The shape generalizes: any extension point in your system can have its own base class, its own family of implementations, and its own data schema describing how it's configured per tenant.

Where the data lives

The configuration has to be persisted somewhere. The options aren't equivalent:

In-memory cache. Tempting because it's fast, but caches get invalidated, evicted, and reset on deploy. If the cache is the source of truth, you've lost the data the moment something restarts. Caches belong in front of the source of truth, not in place of it.
Files on disk. Workable for very small, very stable configurations, but file I/O is slow at scale, file deployment is operational overhead, and "edit a file and redeploy" doesn't fit the case where customer success needs to toggle something for a tenant at 4pm on a Friday.
Static configuration baked into the app. Fine for values that genuinely never change between deploys. But if the values are tenant-specific, you're back to the "code per customer" problem.
A database. If you're already running one — and you almost certainly are — this is the clear winner. Reads are fast (especially with a thin cache in front), updates are transactional, the data sits next to the tenant records it's associated with, and you get backups, replication, and access control for free.

Use the database you already have. Don't introduce a new piece of infrastructure for this.

A note on schema

Whichever shape you pick, the configuration has to be retrievable by tenant. That means a tenant_id foreign key, typically a dedicated tenant_configurations table with tenant_id referencing tenants, indexed for fast lookup. The runtime question is always the same: "given the tenant for this request, what's their configuration?" Get that relationship in place first; everything else flows from being able to find the right rules for the right tenant.

If you're using a relational database, the principled approach beyond that is to model the configuration with normalized tables — a tenant_discount_rules table with tenant_id, typed columns for rule type, percent, min_order, and so on, or a polymorphic schema with a separate table per rule type. This is fine, and you may end up there. But I'd push back on starting there.

For an initial proof of concept, a single table is enough:

CREATE TABLE tenant_configurations (
  tenant_id   BIGINT PRIMARY KEY REFERENCES tenants(id),
  config      JSONB  NOT NULL DEFAULT '{}'::jsonb,
  updated_at  TIMESTAMP NOT NULL DEFAULT NOW()
);

One row per tenant, the primary key handles the lookup index, no migrations needed when you add a new kind of rule. You fetch the row by tenant_id, parse the config JSON, hydrate it into your rule classes, run them. When the configuration stabilizes, when querying into the configuration becomes important, or when validation needs to live at the database level, that's the moment to normalize. Until then, JSON in a column is the shortest path from idea to working code, and you can refactor toward structure once you know what the structure should be.

The security cliff

There is one thing you must not do, no matter how convenient it looks: do not store executable code in the configuration, and do not let configuration values be interpreted and run.

That means no eval, no exec, no embedded JavaScript or Python or Ruby expressions, no SQL fragments concatenated into queries, no template engines that allow arbitrary function calls. It is tempting (really tempting) to support a configuration that looks like:

{
  "discount_amount": "order.total * 0.1 if customer.tier == 'gold' else 0"
}

…and eval that string at runtime. Do not. The moment you do, anyone who can write to that configuration row can execute arbitrary code on your servers, with the privileges of your application. That's not a feature; that's a remote code execution vulnerability you built on purpose. It doesn't matter that the configuration is "only" editable by admins, or "only" through your UI — the surface area expands the moment another bug exposes that table, the moment a credential leaks, the moment an internal account is phished. The configuration becomes the attacker's payload delivery mechanism, and you handed them the loaded gun.

The correct discipline is strict: configuration is data. It selects between behaviors the code already knows how to perform and supplies typed parameters to them. It never describes a new behavior. If a customer needs a behavior the code doesn't have, the answer is to add a new rule class, not to let them write logic into a JSON blob.

This is also what makes the system safe to expose to customer-success people, support engineers, and eventually self-service customers. The blast radius of a misconfigured rule is "the rule doesn't apply" or "the rule applies wrong". Never "the server runs whatever I told it to."

The shape, summarized

Identify per-tenant extension points and write a small base class for each.
Implement the concrete behaviors as subclasses of that base.
Store tenant configurations as data; start with a JSON column on the tenant record, normalize later if it earns it.
Hydrate the data into classes at runtime; let the classes do the work.
Never, ever let the data become code.

The principle underneath all of this is that code is the menu (the list of things your system is capable of doing) and data is the order. Customers can pick from the menu, in any combination, with any parameters. They cannot rewrite the menu. The chef writes the menu. That's how you keep the kitchen safe.

Why I prefer docker + make

Ian Johnson — Mon, 18 May 2026 14:06:43 +0000

A development environment is one of those things you don't notice until it goes wrong, and then you can't think about anything else. A new hire spends two days getting the project to run. A test passes locally and fails in CI. The Postgres version on your laptop is 13 and production is 15, and a query that's fast for you takes forever in staging. Someone updates a dependency and half the team can't start the app until they delete and reinstall their entire toolchain. Every one of these is a small disaster that shouldn't have happened.

My stack for avoiding this is unfashionable and boring: Docker for the environment, Make for the workflow on top of it. It isn't the only way to do this and it isn't always the best way. But it's the combination I keep coming back to, and the reasons are worth writing down.

What Docker gives you

The pitch for Docker, after all the hype died down, is simple: the development environment is pinned, versioned, and identical to (or very close to) the production one. The image specifies the OS, the language runtime, the system libraries, the binaries. Two developers running the same image are running the same environment. The CI server running the same image is running the same environment. The container you deploy to production is running the same environment. "Works on my machine" stops being a meaningful thing to say, because the machine isn't yours — it's the image.

The same property extends to dependent services. With docker compose, adding a Postgres or a Redis or a RabbitMQ to your local stack is a few lines of YAML, and the version is pinned the same way. Nobody on the team has to install Postgres on their laptop. Nobody has to remember which Postgres version the project needs. The configuration is in the repo, and it stays in sync with the application code that depends on it.

Once you have this, you stop being surprised by environment differences. That alone is worth a lot.

Why Docker alone isn't enough

The catch is that Docker commands are verbose, unmemorable, and easy to invoke wrong. Running tests inside the container looks something like:

docker compose run --rm app bundle exec rspec spec/

That's four pieces of information — the orchestrator, the subcommand, the service name, the actual command — and you have to remember all of them in the right order. The first time you type it, you'll fight with it. The tenth time, you'll have it muscle-memorized. The hundredth time, you'll mistype it under pressure and wonder why the test runner can't find your file.

It gets worse. Most developers will have bundle or npm or composer installed locally too, because their editor wants it for IntelliSense or because of some other tool that needs it. The temptation to run bundle exec rspec outside the container (because it's shorter and faster to type) is constant. And the moment some people on the team are running tests inside the container and others are running them outside, you're back to "works on my machine," just with extra steps. Discipline isn't a feature of the stack at that point; it's something each developer has to provide individually, which means at least one of them will provide less of it on a tired afternoon.

Make as the single front door

Make is what I use to fix this. It is older than most of the things on my computer, ubiquitous on Unix systems, and almost embarrassingly simple. A Makefile is a list of named targets, each of which expands to a command (or several). You type make test and the Makefile knows what that means.

A small Makefile might look like:

.PHONY: up down test lint shell migrate

up:
    docker compose up -d

down:
    docker compose down

test:
    docker compose run --rm app bundle exec rspec

lint:
    docker compose run --rm app bundle exec rubocop

shell:
    docker compose run --rm app bash

migrate:
    docker compose run --rm app bundle exec rails db:migrate

That's the whole game. The verbose Docker commands are now invoked by their semantic names. make test is shorter, more memorable, and harder to get wrong than the underlying command. It also doesn't matter whether you have Ruby or Node installed on your laptop, because make test always goes through the container, every time, for everyone. The discipline isn't something each developer carries; it's built into the workflow.

There's a secondary benefit I value a lot: the Makefile is documentation. A new developer reading the Makefile learns what the project's workflow is — these are the operations the team performs, with these names. The README can say "run make test" and the implementation details of how that actually happens are one click away, but invisible to anyone who just wants to do the thing.

The agent angle

This is where the stack pays an extra dividend. When a coding agent works on your project, it has to figure out how to do basic operations — run the tests, format the code, start the database, apply migrations. Without a clear convention, it'll improvise, and it'll improvise differently every time. Sometimes it'll find your Docker commands. Sometimes it'll try to run things on the host and fail because the language runtime isn't installed there. Sometimes it'll invent a new approach that almost works.

With a well-documented Makefile and an instruction in the agent's harness that says "always use make targets; never invoke Docker or language tooling directly," you've given the agent a stable, narrow interface to the project. It doesn't have to figure out the environment. It just has to read the Makefile.

This is the same property that helps human developers, applied to a contributor that doesn't get tired or bored of typing make test for the thousandth time. The agent gets the same single front door. Your CI uses the same targets. Your local development uses the same targets. The whole team, humans and agents both, converges on one way to do each thing, and that one way runs inside the same environment as production.

The honest caveats

Make isn't fashionable. The syntax is finicky (tabs, not spaces, and the rules around variable expansion will bite you eventually). For very complex workflows, you'll outgrow it, and you'll either learn to live with its quirks or move to a task runner like grunt, rake, or just, which is essentially Make with the rough edges sanded off. The reason why I prefer Make over many other task runners is that it does not depend on a runtime.

Docker has its own costs. The image build cycle adds friction. File watching across the container boundary can be slow. On macOS, the virtualization layer has historically been a source of mysterious performance issues, though it's gotten much better. None of this is free.

But neither is the alternative. The alternative is the new hire on day one, the version skew on Postgres, the test that passes for one developer and fails for another, the agent that can't figure out how to run anything. Docker plus Make is a way of paying these costs once, up front, in the form of a Dockerfile and a Makefile, instead of paying them again and again as small ongoing surprises. That trade has consistently worked out in my favor.

"It works" has two jobs in software. Sometimes it describes a state. Sometimes it ends a conversation. A short piece on the difference, and what it costs you later.

Ian Johnson — Fri, 15 May 2026 16:31:22 +0000

Ian Johnson

May 15

It werks!

#webdev #softwareengineering #testing #backend

Comments 2

6 min read

It werks!

Ian Johnson — Fri, 15 May 2026 16:26:13 +0000

Someone on your team says "it works" and there's a moment of relief, maybe even satisfaction. The deploy went out. The bug is gone. The new feature is up. Whatever they were wrestling with, it works.

It's worth pausing on that phrase, because "it works" is doing a lot of hidden work itself. Working software is genuinely valuable — much better to have something that does the thing than something that doesn't. But "works" is a weak property when you don't have the others next to it. Reliable. Robust. Predictable. Dependable. Tested. Each of those is something stronger than "I ran it just now and it didn't break." Each of them is a claim about what the software will keep doing, under variation, under load, under change, when nobody is watching.

When you have the observation without those properties, what you have is software that werks. It sounds the same as "works" when you say it out loud. It'll pass casual inspection, it'll satisfy the demo, it'll close the ticket. But it isn't correct, in the same way "werks" isn't correctly spelled. It happens to produce the right output for the inputs it has seen, by some combination of luck, coincidence, and undocumented assumption. Push on it a little, and it stops.

What incidentally-working code looks like

The clearest examples are the ones where two bugs cancel each other out: a function that computes the wrong answer, fed into another function that, by coincidence, expects exactly that wrong answer. Fix either one in isolation and the system breaks. Nobody knows this, because no test ever exercised the boundary; the only thing keeping the lights on is that nobody has touched either function in a while.

There's a softer version that's much more common. A function processes the inputs you happen to feed it today and produces correct outputs. The inputs you haven't fed it (slightly different formats, edge cases, sizes outside what you've seen) would produce silently wrong outputs. Currency math that's right for USD and broken for JPY. Date handling that's right in your timezone and wrong everywhere else. A regex that matches the strings you tested and matches half the URLs in production by accident. A query that returns the right rows in the right order because the database happens to have a particular index, and one day someone drops the index for an unrelated reason.

And then there are the timing cases. A race condition that almost always loses, until it doesn't. An eventual-consistency window that's almost always shorter than the next read, until traffic spikes and it isn't. A retry that almost always succeeds within three attempts, until the downstream service has a bad day.

All of this is "it works." None of it is reliable.

The danger is what you build on top

A piece of incidentally-working code is a small problem. The bigger problem is what happens next, which is that someone builds on top of it. They don't know it's incidental. From their perspective it looks like a normal function returning a normal value. So they call it from another function, which calls it from another function, and now the assumption that the original code happened to satisfy is load-bearing for half the system.

The longer this goes on, the more expensive the eventual correction gets. By the time you discover that the foundation has a property nobody intended, you have to fix the foundation and update everything that came to depend on the accidental property. If your discovery happened because of a production incident, you're doing that work while customers are watching.

This is how codebases acquire that distinctive quality where nobody wants to touch certain modules. It isn't that the modules are complicated...well, they often are, but that's a symptom. It's that nobody is sure which parts are doing what they look like they're doing and which parts are doing something subtler that happens to come out right. Every change risks knocking over one of the invisible struts.

The fix is refactoring, and refactoring needs tests

The way out is the same way you'd handle any code you don't fully trust: get tests around it, then change it. Pin down what it currently does (characterization tests, if the current behavior is unverified) and then refactor toward something where the behavior is intentional rather than accidental. Replace the regex that happens to work with one that says what it means. Replace the timing assumption with an explicit synchronization or an idempotency check. Replace the implicit dependency on a database index with an ORDER BY clause. Replace the JPY-breaking currency math with a money type that respects precision.

The tests are what make this safe. Without them, you can't tell whether your refactor preserved the accidental property that some downstream code is quietly depending on. With them, you can change the code with confidence. Even better, when you discover that something downstream was depending on the accident, the failing test tells you exactly where, instead of a customer telling you on Twitter.

The compound result, over time, is a codebase whose behavior is intentional. Things work because someone made them work, in a specific way, on purpose. Things continue to work because the tests catch you when you slip. That's the difference between a system you can confidently change and a system you tiptoe around.

When "it works" becomes the argument against fixing it

There's a flip side to all of this, which is when "it works" stops being an observation and starts being a defense. You raise a concern about a piece of code (the timing is fragile, the regex is doing something a regex shouldn't be relied on for, the currency math is going to break the day someone adds a non-USD customer) and the response comes back: it works. We have other priorities. If it ain't broke, don't fix it.

The phrase sounds like a cost-benefit analysis, but it isn't one. A real cost-benefit analysis would name what you're getting and what you're giving up. "It works" skips straight to the conclusion by treating "works" as a binary — either it does or it doesn't, and since it does, we're done. Everything in the preceding sections is the case for why that binary is the wrong frame.

What you're actually accepting when you deploy "it works" as a defense is a list of things, and they're worth saying out loud. You're accepting that the accidental property will continue to hold under conditions you can't enumerate, because you haven't enumerated them. You're accepting that when it does break, it will break at a time you didn't choose. Usually the worst time, because the conditions that break it correlate with unusual load, unusual data, unusual everything. You're accepting that the fix will be more expensive later, because more code will have come to depend on the current behavior in the meantime. And you're accepting that the people who understand the system well enough to fix it cheaply today may not be on the team by the time the bill comes due.

None of that is automatically wrong as a tradeoff. Sometimes you genuinely don't have the cycles, and the expected cost of the eventual incident really is lower than the cost of fixing it now. That's a real call to make. But it's a call you can only make honestly if you've named the thing you're trading away. "We know this is fragile in these specific ways, and we're choosing to leave it because X" is an engineering decision. "It works" is the version of that sentence where everything after "works" has been quietly deleted — and what's left sounds like a reason but is actually a refusal to look.

So the next time you hear "it works"

Ask one more question. Does it work, or does it werk? Is the behavior a property you can rely on, or is it an observation you got lucky with? Are there tests that say it will keep working, or just a person who ran it once and didn't see it break?

And ask hardest when "it works" is being used to end the conversation rather than describe a state. The defensive "it works" is the one most likely to be covering something its speaker hasn't actually looked at.

Working software is good. It is genuinely better than software that doesn't work. But "it works" is the floor, not the ceiling, and a system built entirely out of code that satisfies the floor is a system that will surprise you. Push for the rest — for code whose correctness is intentional, whose behavior is pinned down, whose dependencies are explicit. That's when "it works" becomes the same word in writing as it is when you say it out loud.

Stop nesting deeply

Ian Johnson — Fri, 15 May 2026 14:20:00 +0000

Open a function and let your eyes drift to the right edge of the screen. If the code is leaning over halfway to that edge by line ten, the function is in trouble. Maybe not in a way that breaks tests (deeply nested code can be perfectly correct) but in a way that breaks comprehension. Every level of indentation is another condition the reader has to hold in their head to understand what the innermost line means. Five levels in, the reader is tracking five separate predicates, and the actual work is squeezed against the wall.

This isn't a new observation. The JavaScript community has a name for the worst case: callback hell. It's the staircase of function(err, result) { followed by another, and another, each indented further than the last, until the actual business logic (the reason the code exists) is buried so far inside that you have to scroll right to read it. The escape was promises, then async/await, but the underlying problem isn't specific to callbacks. It shows up wherever code nests deeper than it needs to: nested loops, nested conditions, nested try/catch, nested methods that themselves contain nested blocks. The shape is always the same arrow drifting toward the right margin, and the cost is always the same loss of readability.

Early returns flatten the function

The single most useful technique is the guard clause: when a precondition fails, return (or raise) immediately. Don't wrap the rest of the function in an if and indent everything inside it. Send the bad cases out the front door so the happy path can run flat.

Here's a Python example. The deeply nested version:

def charge_customer(customer, amount):
    if customer is not None:
        if customer.is_active:
            if customer.has_payment_method():
                if amount > 0:
                    return process_charge(customer, amount)
                else:
                    raise ValueError("amount must be positive")
            else:
                raise ValueError("no payment method")
        else:
            raise ValueError("customer is inactive")
    else:
        raise ValueError("customer is required")

The flat version says the same thing:

def charge_customer(customer, amount):
    if customer is None:
        raise ValueError("customer is required")
    if not customer.is_active:
        raise ValueError("customer is inactive")
    if not customer.has_payment_method():
        raise ValueError("no payment method")
    if amount <= 0:
        raise ValueError("amount must be positive")
    return process_charge(customer, amount)

Same logic, same checks, same outcomes...but the second version reads top to bottom like a list of preconditions followed by the actual work. There's no rightward drift, no else clauses to track, and the line that does the real thing is at the same indentation level as the function itself. You can see at a glance what the function does: charge the customer, assuming a handful of conditions are met.

The other thing happening here, quietly, is that the function is now using exceptions for the error cases rather than nesting around them. That's the move from the previous post, applied: when something prevents the function from doing its job, raise; let the caller decide what to do about it. Exceptions are the natural partner of guard clauses. They're how the bad cases leave the function without forcing the good cases to indent around them.

Let collections do the filtering

A lot of nesting hides inside loops. The classic shape is "iterate, check, skip": a for loop with an if that excludes the items you don't care about, and a continue or a nested block for the rest. Whenever you see that pattern, there's almost always a filter you haven't named yet.

Ruby gives you a clean way to skip the nesting entirely. Instead of:

def total_active_balances(accounts)
  total = 0
  accounts.each do |account|
    if account.active?
      if account.balance > 0
        total += account.balance
      end
    end
  end
  total
end

…filter first, then sum the result:

def total_active_balances(accounts)
  accounts
    .select { |a| a.active? && a.balance.positive? }
    .sum(&:balance)
end

The second version has no nesting, no accumulator variable, and reads almost like the spec: from accounts, select the active ones with a positive balance, then sum their balances. The collection operations are the filtering and the aggregation; you don't need a control-flow scaffolding around them.

This leans into functional programming a bit, which is fine in OOP - it's not that you can't use the techniques, it's about the main unit of abstraction. Notice here we are replacing an iterative loop that requires cognitive skill with a declarative description that is much more easily understandable. It was estimated that 80% of IBM's mainframe could have been replaced with filter, map, and reduce. Higher-order functions are powerful. They allow you to focus on the domain, not on managing state in loops.

The same pattern works in Python with comprehensions or generator expressions, and in any modern language with collection pipelines. continue and break are useful when you really need them, but most of the time they're a sign that the loop body is doing two jobs (selecting which items to process AND processing them) and one of those jobs belongs to the collection, not to the loop.

Validate at the edge, trust the middle

Defensive programming gets a bad reputation when it's applied uniformly. Checking every argument in every function for null, type, and range produces a codebase that's mostly assertions and barely any logic. But applied at the edges of a module or a system, it cuts nesting deep inside.

The idea is: validate inputs once, at the boundary where untrusted data enters your code. After that, the rest of the code is allowed to assume the inputs are valid. Inside the trusted region, you don't write if $user !== null around every operation, because the boundary already established that $user is a real user.

Here's a small PHP example. Without an edge check, every method has to defend itself:

class OrderService {
    public function place(?Customer $customer, ?Cart $cart): Order {
        if ($customer !== null) {
            if ($cart !== null) {
                if (!$cart->isEmpty()) {
                    // ... actual logic, three levels deep
                }
            }
        }
    }
}

With validation pushed to the entry point — the controller, the request handler, wherever untrusted data arrives — the service can assume its inputs:

class OrderService {
    public function place(Customer $customer, Cart $cart): Order {
        if ($cart->isEmpty()) {
            throw new EmptyCartException();
        }
        // ... actual logic, no nesting
    }
}

The types now say "non-null"; the one precondition that's specifically the service's job to check is handled with a guard clause; the actual work is flat. The defensive checking still exists, but it lives where it makes sense (at the boundary) instead of being smeared across every function in the system.

Why this matters

Flat code isn't a stylistic preference. It's a property that makes code readable, which makes it changeable, which makes it reliable over time. A reader scanning a function should be able to see, at a glance, what it does: take these inputs, check these conditions, perform this work, return this result. Every level of indentation is a hedge the reader has to keep tracking — "we're inside the case where X is true and Y is false and Z is non-null" — and humans run out of stack space for that quickly. So do agents, by the way.

The techniques are all small. Invert a condition and return early. Replace a nested if/continue with a filter. Push validation to the boundary. Let exceptions carry error paths up the stack instead of nesting around them. None of these are clever. They're just the discipline of letting the function's shape match what the function actually does — preconditions first, work in the middle, result at the end, errors out the side. When the shape matches the meaning, the code stops fighting the reader. That's the goal.

Credentials in web applications: how to store them properly

Ian Johnson — Thu, 14 May 2026 15:45:06 +0000

Almost every breach you read about in the news involves credentials. Sometimes it's passwords pulled out of a database that hashed them badly. Sometimes it's an API key committed to a public GitHub repo. Sometimes it's a session token stolen from a JavaScript variable because somebody stored it in localStorage. More recently, it's an API key left public in a vibe-coded app. The technical details vary; the underlying problem is usually the same. Someone treated a secret like ordinary data and stored it the way they would store anything else.

This guide covers what counts as a credential, the small number of things you actually need to do to handle each kind correctly, and the mistakes that show up over and over in real applications.

The three kinds of credentials

The first thing to internalize is that "credential" isn't one thing. There are at least three categories, and they need to be handled differently:

User credentials. What your users give you to prove who they are — passwords, primarily. You don't actually want to store these; you want to store something that lets you verify them later without being able to recover the original.
Session credentials. Tokens your app issues after a user logs in, so they don't have to log in again on every request. Cookies are the most common form.
Service credentials. The secrets your app needs to function: database passwords, API keys for third-party services, signing keys, encryption keys. Your application holds these; users never see them.

Storage strategy for each is genuinely different. Mixing up the categories — "encrypting" user passwords because you encrypt API keys, or stuffing service credentials in client-side code because you ship session tokens to the browser — is where a lot of trouble starts.

User passwords: hash, don't encrypt

The most important sentence in this article: you do not store user passwords. You store password hashes. A password is something a user gives you at login. A hash is a one-way function of that password. When the user comes back, you hash what they typed and compare it to what you stored. If anyone steals your database, they get hashes, not passwords. They still have to crack the hashes to learn anything useful...and with a modern hashing algorithm, cracking is intentionally slow.

The correct algorithms today are bcrypt, scrypt, and argon2 (specifically argon2id). All three are designed to be slow and memory-hard, which makes brute-forcing them expensive. They also handle salting for you automatically. Every password gets a unique random salt, mixed into the hash, so two users with the same password get different stored values, and an attacker can't precompute a rainbow table once and reuse it across accounts.

What you must not use: MD5, SHA-1, or any single application of a fast hash like SHA-256. These were designed to be fast, which is exactly the wrong property for password hashing. Modern GPUs can compute billions of fast-hash operations per second. A database hashed with an unsalted fast hash gets cracked in hours, sometimes minutes.

You also don't need to write any of this yourself. Every mainstream language has a vetted library. In Python you use bcrypt or argon2-cffi. In Ruby, bcrypt is built into Rails via has_secure_password. In PHP, password_hash() and password_verify() are in the standard library and use bcrypt by default. The library generates the salt, picks a cost factor, and produces a single string you store in a column. You give it the user's input and the stored value; it tells you yes or no.

A short list of things people get wrong here:

Using md5(password + salt) because someone read about salting on a blog from 2008. Salting helps, but it does not fix the speed problem. Fast hashes are still fast.
"Encrypting" the password so they can email it back to the user if they forget it. If you can recover a password, so can an attacker who steals your database and the encryption key, which usually lives nearby. Implement password reset via a time-limited token sent to the user's email instead.
Logging the password. It happens constantly; a debug log on the login endpoint that dumps the request body. Scrub sensitive fields out of your logs before they leave the application.

Session tokens: cookies done right

Once a user logs in, you need a way to recognize them on subsequent requests without making them log in every time. That's a session credential. Almost always, it should be a random, opaque token stored on the user's machine and sent back to your server with each request. The server looks up that token to find the session.

The standard mechanism is a cookie, and the cookie needs three flags set:

HttpOnly prevents JavaScript on the page from reading the cookie. This is the single most important defense against an XSS attack stealing the session.
Secure prevents the cookie from being sent over plain HTTP. Always set this in production.
SameSite=Lax (or Strict, depending on your needs) prevents the cookie from being sent on cross-site requests, which protects against CSRF.

The token itself should be long (128 bits of entropy or more) and generated by a cryptographically secure random generator — secrets.token_urlsafe() in Python, SecureRandom.urlsafe_base64() in Ruby, random_bytes() in PHP. Not the regular random function. Not a UUID v4 (close, but its spec doesn't guarantee the entropy distribution you want for security tokens). The crypto-grade RNG (random number generator).

A few words on JWTs. They're popular, especially in single-page-app and mobile contexts, but they're often misused. A JWT is a self-contained, signed token that proves the bearer is allowed to do something. The trade-off is that you can't easily revoke them — if someone steals a JWT, it's valid until it expires. Sessions stored server-side (in Redis, in your database) can be invalidated on the server with a single record delete. If you don't have a specific reason JWTs solve a problem for you, prefer server-side sessions. And if you do use JWTs, never put them in localStorage — that's readable by any JavaScript on the page, defeating the protection HttpOnly cookies give you. Send them as HttpOnly cookies or, at minimum, hold them in memory and not in storage that survives a page reload.

Service credentials: outside the code, outside git

Your application has secrets it uses to talk to other systems: a database password, a Stripe API key, an SMTP credential, signing keys, OAuth client secrets. These need to be available to the running application but invisible to everyone else.

The non-negotiable rules:

Never commit secrets to source control. Not in code, not in config files checked into the repo, not in tests, not even in commit messages. Once a secret has touched a git history, treat it as compromised and rotate it — even if you rewrite history to remove it, you have to assume it leaked.
Never put secrets in client-side code. Any "secret" in your JavaScript bundle, your mobile app binary, or your HTML is public. Anyone can View Source or unzip the APK. If your frontend needs to call a third-party API that requires a secret, proxy that call through your backend.
Use environment variables or a secret manager. In development, a local .env file (listed in .gitignore) is fine. In production, use environment variables injected by your deploy system, or a dedicated secret manager like AWS Secrets Manager, GCP Secret Manager, or HashiCorp Vault. Secret managers add rotation, access control, and audit logging, which matter as your team and surface area grow.

Some additional good practices: use different credentials for different environments (the staging database password is not the production one), grant each service the narrowest permissions it actually needs, and rotate credentials periodically, and immediately if anyone with access leaves the team or if there's any hint that one might have leaked.

In CI: GitHub Actions and similar

Continuous integration is one of the most common places credentials leak from. Build logs are often visible to anyone who can see the repo, workflows run third-party actions whose code can change between releases, and a misconfigured pipeline will happily print a secret on its way to using it. A few rules cover most of the risk.

Use the platform's secret store. In GitHub Actions, that's Settings → Secrets and variables → Actions. Add the secret there, then reference it in the workflow as ${{ secrets.MY_SECRET }}. GitHub will automatically mask the value in logs if it appears verbatim, but masking is a safety net, not a strategy — don't echo secrets, don't pass them as command-line arguments (they show up in process listings on the runner), and don't write them to files that get uploaded as build artifacts.

Scope secrets to environments. GitHub Actions lets you attach secrets to a named environment (production, staging) and even require manual approval before a workflow can access them. This means a pull request from a feature branch can't accidentally (or maliciously) pull production keys. And if your CI runs on pull requests from forks, be especially careful: by default, fork PRs don't get access to repository secrets, which is the safe behavior. Don't undo that without understanding what you're enabling.

Prefer short-lived credentials over long-lived ones. For cloud providers, that means using OIDC: GitHub Actions can authenticate directly to AWS, GCP, or Azure and receive a short-lived token scoped to exactly what the workflow needs, with no long-lived access key stored in the repo at all. This is the modern best practice. Setting it up is more work than pasting an access key into a secret. But the work is worth it, because there's nothing static to steal.

Finally, audit your third-party actions. A popular action with thousands of stars is still arbitrary code running in the same environment as your secrets. Pin actions to a specific commit SHA rather than a moving tag like @v1, and review the source of anything new you bring in.

Setting credentials in staging and production

Production credentials need to be available on the running server without ever passing through a place they don't belong, such as your repo, your container image, your build artifact, a Slack channel, or an engineer's laptop.

The standard approach is to inject them at runtime, not bake them in. Most platforms have a built-in mechanism: Heroku and Fly have config vars; AWS ECS and Kubernetes have native Secret resources; systemd has EnvironmentFile; Vercel, Netlify, and Render expose environment variables in their dashboards. The application reads from environment variables at startup, and the platform is responsible for getting the right values into the environment of the right process. If you graduate to a dedicated secret manager (AWS Secrets Manager, GCP Secret Manager, HashiCorp Vault, Doppler), the same pattern holds. The app reads the secret at startup or fetches it on demand, and never sees the secret at build time.

What this means concretely: do not bake secrets into Docker images. A Docker image is a near-public artifact even when it lives in a private registry. Anyone who pulls it sees everything inside, including historical layers that you thought you deleted. The same goes for AMIs, build artifacts uploaded to artifact stores, and anything produced at build time. Build-time inputs should never include production secrets.

Use different credentials for different environments...really! Staging and production should never share a database password, an API key, or a signing secret. If they do, a leak from the less-protected environment compromises the more-protected one. The same goes for personal development credentials: never let an engineer's local .env contain production values.

Where the platform supports it, prefer identity-based access over stored credentials. AWS IAM roles, GCP workload identity, and the equivalents on other clouds let your running application authenticate as itself, with no static key sitting anywhere. The cloud provider verifies the workload and hands it a short-lived token. This eliminates an entire class of leakage, because there's nothing long-lived to leak.

Finally, audit who has access. The number of humans who can read production secrets should be small, named, and reviewed periodically. Most secret managers log every access. Check those logs. And when a developer leaves the team, rotate the secrets they had access to, not just their personal accounts.

The client-side trap

This deserves a section of its own because it catches people constantly. The browser is not a trusted environment. Anything your JavaScript can read, the user (or an attacker who has gotten code running on the page) can read. That means:

API keys for third-party services — Stripe secret keys, AWS credentials, anything labeled "secret" — must never appear in the browser. If you need to do something privileged from a user action, the user's action calls your backend, and your backend uses the secret.
Authentication tokens stored in localStorage or sessionStorage are accessible to any script that runs on the page, including one injected via XSS. Prefer HttpOnly cookies.
"Hidden" form fields, obfuscated JavaScript, environment variables embedded at build time — none of these hide anything. They just make it slightly slower for an attacker to find what they're looking for.

The mental model: if a value is shipped to the browser, it's public. Design accordingly.

A few cross-cutting principles

A handful of habits cover most of what you need:

Treat every credential as compromised the moment it leaks. Rotate immediately. Don't argue about whether the leak was "really" a leak.
Use libraries, not your own crypto. Hashing, signing, token generation — there's a vetted library in every language. Use it.
Defaults matter more than configuration. A teammate who doesn't know the rules should fall into the pit of success: secrets loaded from environment, cookies with the right flags, passwords going through the standard hashing helper. Make the safe path the easy path.
Audit what you log. Log scrubbing is one of the cheapest, highest-value security investments you can make. Get sensitive fields out of your logs before they ever land in your log aggregator.
Assume the database will be stolen. That's the test for whether your credential storage is good. If your database leaks tonight, what does the attacker learn? Hashes? Or passwords?

The whole topic comes down to the difference between secrets and data. Data lives in databases, gets serialized into responses, shows up in logs, gets debugged in console statements. Secrets cannot do any of that. The skill is recognizing when something is a secret and storing it accordingly, every single time.

New post on why testing is no longer optional in the new world of agentic coding, how to start when you have zero tests, and what to do when the code resists being tested at all.

Ian Johnson — Thu, 14 May 2026 14:14:59 +0000

Ian Johnson

May 14

Automated tests are required now

#testing #ai #webdev #productivity

Comments

8 min read

Automated tests are required now

Ian Johnson — Thu, 14 May 2026 14:11:44 +0000

Many teams still test their software by having a person click around in a staging environment and report bugs back. Or by a developer doing the same thing in a local environment. This has always been a bad practice. It's becoming an untenable one.

For a long time, the case for an automated test suite was a productivity argument. You could ship without tests (plenty of companies did, plenty still do) but you paid for it in slower iteration, scarier deploys, longer regression cycles, and the slow accretion of fear around the parts of the codebase nobody wanted to touch. Manual QA worked, in the sense that it caught some bugs some of the time. It just didn't scale: every new feature meant a longer test plan, every release meant a longer freeze, and every refactor was a gamble.

Then agents started writing meaningful amounts of the code.

What changes when code volume goes up

A coding agent will happily produce more code in an hour than a developer used to ship in a week. If you're paying attention, that should sound less like a productivity win and more like a stress test on every part of your engineering process that wasn't designed for that volume. Code review becomes a bottleneck. Manual QA becomes a much worse bottleneck; the human in the loop who has to click through forty flows after every change is now the slowest moving part of a system whose other parts have all gotten dramatically faster.

But it's not just speed. It's signal. An agent that writes a change has no way to know whether the change works. It can read the code, it can reason about it, but it cannot verify it until something runs. If the only thing that can tell the agent whether the change works is a person opening the app in staging, then either you've put a human on the critical path of every single change, defeating the point, or the agent ships the change without proof and you find out later, in production, that something subtle broke.

Tests give the agent a way to prove its work. That's what they've always done for humans, too: the value didn't change, the volume did. But when you have agents producing code at a rate that manual QA can't possibly keep up with, "we'll just test it by hand" stops being a tradeoff and starts being a non-answer.

Agents reproduce what they find

There's a second, quieter problem. Agents pattern-match on the codebase they're working in. If the codebase has tests, the agent will write tests, because that's the convention. If the codebase has no tests, the agent will not write tests, because that's also the convention. The codebase teaches the agent what "done" looks like.

This means an untested codebase doesn't just lack tests — it actively trains every contributor, human or otherwise, that tests aren't part of the work. The longer this goes on, the more entrenched it gets, and the harder it is to break out of, because the new code being written assumes the absence of tests and is shaped in ways that make testing harder.

"But we have no tests"

This is the most common reason teams stay untested: the existing codebase wasn't built for it, and the gap between zero and "tested" looks impossibly large. It isn't, and you don't have to close it all at once.

Start with characterization tests. These are tests that don't try to specify what the code should do — they pin down what it currently does. You run the existing code, you observe its outputs, you write a test that asserts those outputs. The test is now a tripwire: if you change the code's behavior, even by accident, the test will tell you. It doesn't matter if the current behavior is right or wrong; you're not making a moral claim about the code, you're making a factual one about what it does today. Once you have characterization tests around the parts that matter most, you've bought yourself the ability to change those parts safely.

From there, you keep going. The next feature gets real tests. The next bug fix gets a regression test. The most-important, most-changed, most-feared module gets enough coverage that you can finally refactor it. You don't need 100% coverage — you need enough coverage in the right places that the work you're actually doing is protected.

"But we can't test"

The deeper objection isn't that there are no tests, it's that the code resists them. Functions are 600 lines long and reach out to half the system. Database calls are sprinkled inline. Globals are mutated from anywhere. The class you'd want to test takes a configuration object in its constructor that itself takes the entire universe. You've tried, and writing a single useful unit test required mocking nine things, and you're not even sure the test is testing what you thought it was.

This is real. Some codebases genuinely are structured in a way that makes testing painful. But "untestable" is almost always "untested in this shape" and the path from one to the other is the same iterative path you took to get coverage on the parts you could already test, just with a small refactoring step folded in.

The move you make over and over is: introduce a seam. A seam is a place where you can substitute behavior without changing surrounding code. You don't fix the whole module to test one piece of it; you isolate the piece you care about by pulling it through one well-chosen seam, and leave the rest alone for now. A handful of techniques come up again and again:

Extract a function. Pull a chunk of logic out of a larger function so it can be called on its own. The extracted function takes its inputs as parameters, returns a value, and is trivial to test. Often this single move is the entire refactor.
Pass dependencies in. Instead of reaching for a database, a clock, or an HTTP client inside the function, accept them as arguments. The production caller passes the real thing; the test passes a fake.
Wrap external systems behind a thin interface. Don't test directly against a library or service you don't control. Wrap it in your own small interface that says exactly what your code needs from it, and substitute a fake implementation in tests.
Parameterize the side effect. If a function reads a file, accept the contents as a parameter. If it asks the clock for now, accept now as a parameter. The "where it comes from" question moves up one layer; the function itself becomes pure.
Separate the decision from the action. Split "compute what should happen" from "make it happen." The decision function is pure and easy to unit-test; the action function is thin and verified by a smaller number of integration tests.

The first test in a section like this is the most expensive: you're paying the seam-creation cost, the fake-setup cost, and the "what does this function actually do" cost all at once. The second test is dramatically cheaper because most of that work has already been done. By the fifth or sixth, you're moving at normal speed, and the code around your seam is meaningfully cleaner than it was before. That isn't a coincidence. Code that's easy to test tends to be code with explicit inputs, narrow responsibilities, and few hidden dependencies — the same properties that make code easy to read and easy to change. Working toward testability is working toward better design; the test is what tells you you've gotten there.

When even that's too hard at first, write a slower, broader test. An end-to-end test that drives the system through a real database is worse than a fast unit test in almost every way — slower, flakier, less precise about what failed — but it's better than no test. It gives you a tripwire. With the tripwire in place, you can refactor the inside toward something easier to cover with smaller tests, and you'll know if you broke anything along the way.

The honest version of "we can't test this code" is "we can't test this code without changing it." That's true, and the answer is: change it. Not all at once. Just the part you need to test today, in the smallest way that gives you a foothold. Tomorrow you'll have a foothold and a test, which is exactly the position you need to be in to take the next step.

The cycle that opens up

Once tests are in place, things you couldn't reasonably do before become possible. Refactoring becomes a normal activity instead of a heroic one, because the tests catch you when you slip. Test-driven development becomes available. You can write the test first, watch it fail, make it pass, and trust the result, which is a fundamentally different experience from writing code and hoping. Designs improve, because code that's easy to test tends to be code with clean boundaries and explicit dependencies, and writing tests pushes you toward that shape whether you intended it or not.

The system gets healthier in a way that compounds. It gets more predictable, because behavior is pinned down. More robust, because regressions get caught. More well-defined, because the tests become an executable specification of what the code is supposed to do. The codebase starts answering questions instead of raising them.

Every bug becomes a test

The other piece of the on-ramp is what you do every time something breaks. When a bug is reported — or worse, when one slips into production — the temptation is to fix the code and move on. Don't. Write the test first: the test that reproduces the bug, fails because of it, and passes once the fix is in. Now the bug isn't just fixed, it's fenced. That exact regression cannot happen again without something explicitly noticing.

Over time, this turns the test suite into an accumulated record of every mistake the system has ever made. The bugs that have already happened are unusually likely to happen again — the same subtle interaction, the same edge case, the same off-by-one — and each one you've fenced off is a class of failures that can no longer eat your time. A team that does this consistently will find its bug reports start looking different: fewer "this used to work," more genuinely new issues.

This composes naturally with characterization tests. Both pin down what is rather than specify what should be — one captures current behavior, the other captures broken behavior that's been corrected. Together they're how a codebase that started without tests becomes one with meaningful coverage where it matters.

The strictness point

The deeper thing tests give you is strictness. They are a forcing function: the code has to actually work, in a specific way, on specific inputs, every time the suite runs. Vague intentions don't pass tests. Hand-waving doesn't pass tests. "It worked when I tried it" doesn't pass tests. The bar is concrete and the bar is enforced automatically.

In a world where more and more of your code is being written by something that doesn't share your intuition, your context, or your sense of what "obviously shouldn't break" means, strictness is the thing that keeps the system coherent. Tests are one of the best-leveraged ways to get it. Types are another. Linters and formatters are smaller versions of the same idea. All of them push the codebase toward a state where the rules are explicit and the machine, not the reviewer's memory, enforces them.

Stop making excuses

Testing used to be a discipline you adopted to make your team more productive. It's becoming a discipline you adopt to keep your codebase functional at all. The teams that have tests are going to absorb the throughput of coding agents and turn it into shipped, working software. The teams that don't are going to drown in unverified changes and spend their time chasing bugs the suite would have caught.

You don't have to write all the tests today. You do have to start. Pick the most important module. Add characterization tests. Refactor under their cover. Move to the next module. Keep going. It's not optional anymore, and pretending it is just means the codebase will keep teaching everyone, including the agents, that testing isn't part of the job.

It is.

Use exceptions for (wait for it) exceptional things

Ian Johnson — Wed, 13 May 2026 14:58:41 +0000

You know the code. A function tries to do something, something goes wrong, and you see:

print(f"Error: could not load config from {path}")
sys.exit(1)

Or this:

result = fetch_user(user_id)
if result is None:
    return None
profile = fetch_profile(result.id)
if profile is None:
    return None
...

Or, in a slightly more sophisticated codebase, the function returns a tuple of (value, error) or a dict like {"ok": False, "error": "..."} and every caller has to remember to check it.

What these patterns share is that they're working hard to avoid using the feature the language built specifically for this situation: exceptions.

The avoidance is real

I don't have data, just years of reading other people's code. But the pattern is consistent: a lot of developers will reach for almost anything before raising an exception. They'll print and continue. They'll die or os.exit. They'll return None and propagate it up by hand through six layers of callers. They'll catch an exception just to convert it into a boolean. They'll silently swallow it with a bare except. They'll handle the error inline, awkwardly, at exactly the layer that has no idea what to do about it.

I don't fully understand why. Some of it is taste — Go made error-as-value fashionable, Rust made Result<T, E> rigorous, and some of that vibe leaked into communities where exceptions are the idiomatic choice. Some of it is fear: exceptions feel like spooky action at a distance because they unwind the stack. Some of it is just forgetting they exist. Most languages teach you try/catch once in an intro tutorial and then never bring it up again.

But exceptions exist for a reason, and the reason is good.

What exceptions actually buy you

The job of an exception is to separate the happy path from the error path. In the happy path, you write what the code is supposed to do, in the order it's supposed to do it, without interrupting yourself every two lines to check whether the last step worked. In the error path, you write what to do when things go wrong — but you write it once, at the layer that's actually equipped to handle the problem, not at every intermediate layer that just happens to be on the call stack.

Compare:

def load_user_dashboard(user_id):
    user = fetch_user(user_id)
    if user is None:
        return None
    profile = fetch_profile(user.id)
    if profile is None:
        return None
    recent = fetch_recent_activity(user.id)
    if recent is None:
        return None
    return build_dashboard(user, profile, recent)

with:

def load_user_dashboard(user_id):
    user = fetch_user(user_id)
    profile = fetch_profile(user.id)
    recent = fetch_recent_activity(user.id)
    return build_dashboard(user, profile, recent)

The second version reads like what the function does. If any step fails, the exception bubbles up to wherever you decided to catch it, probably a single try/except in the request handler that knows how to translate a UserNotFound or a DatabaseUnavailable into the right response. The intermediate layers are blissfully unaware that anything can go wrong, because they have nothing useful to contribute when something does.

That's the whole point. Errors propagate themselves. You catch them where the context to handle them exists. Everywhere else, your code gets to be about the thing it's actually about.

And `print`, `die`, and bare `exit`?

These are worse than no error handling — they're error handling pretending to be helpful. A print followed by exit(1) decides, on behalf of every possible caller, that the right response to a problem is to dump a message to stderr and kill the entire process. That's fine in a one-off script. In a library, a server, a long-running job, or anything called from another piece of code, it's a small disaster. The caller wanted to catch the error and retry, or log it with structure, or surface it to a user, or fall back to a default. Instead, the process died and there's a string somewhere in stderr.

Raising an exception is the polite, composable thing to do. It says: something went wrong here, in this specific way; whoever called me can decide what to do about it.

The "exceptional" part of the title

The other half of the joke is that exceptions are for exceptional things: situations that genuinely prevent the function from doing its job. They are not a control flow mechanism for ordinary, expected outcomes.

A user not being found by ID inside a system that just created them: exceptional. A user not being found when you're looking them up by an email someone typed into a form: expected. That's a normal outcome of the operation, and it should probably be None, a Maybe, or a domain-specific result type. Form validation failing: expected. A network blip while the database is restarting: exceptional. The file you were promised exists not existing: depends on the contract.

The rough test is: when this happens, can the immediate caller plausibly do something sensible about it as part of its normal logic? If yes, it's an expected outcome. Model it in the return type. If no, and the function genuinely couldn't fulfill its contract, raise.

Used this way, exceptions stay rare, which keeps them meaningful. When you see a try/except in well-written code, it's a flag: something here can really go wrong, and someone thought about what to do about it. Used for ordinary control flow, they become noise, and the signal is lost.

The short version

Don't print and pray. Don't die. Don't smuggle errors through return types out of habit when the language has a perfectly good mechanism for them. Don't catch exceptions just to convert them into something less expressive. And don't use them for things that aren't actually exceptional.

Raise when your function genuinely can't do its job. Catch where you actually know what to do. Let the happy path be a happy path. That's what exceptions are for.

What is the domain and why is it so important?

Ian Johnson — Wed, 13 May 2026 14:22:31 +0000

Open a codebase you've never seen before and try to figure out what it does. Sometimes you stare at imports, configuration, error handling, and connection pools for an hour before you find the file where something actually happens: the file where money moves, an order ships, a patient gets scheduled. That file is the domain. Everything else is plumbing.

The domain is the part of your software that talks about the problem you're solving, in the terms of the people who have the problem. A banking system's domain is accounts, transactions, balances, transfers, interest accrual, overdrafts. A clinic's domain is patients, appointments, providers, visits, prescriptions, claims. A logistics company's domain is routes, shipments, drivers, manifests, exceptions. The domain has nothing to do with whether the database is Postgres or DynamoDB, whether the API is REST or gRPC, whether the deploy target is Kubernetes or a single VM. Those are real engineering decisions and they matter, but they aren't the thing. The thing is the business.

This distinction is hard to feel until you've worked on a codebase where it's been ignored. In such a codebase, an "order" isn't an object that knows how to be placed and canceled — it's a row that someone retrieved with a SQL query, mutated with a few setters, and saved with another query, all tangled with logging, retries, and authentication checks. To understand what happens when an order is placed, you have to read everything. The business logic is everywhere and nowhere.

In a codebase where the domain has been taken seriously, you can open Order.place() and read what placing an order actually means: stock is reserved, the customer is charged, a confirmation event is emitted, the order enters a pending state. You learn this without learning anything about HTTP, SQL, or message brokers. That's not an accident. That's design.

Build a language, then program in it

There's a beautiful pattern Abelson and Sussman teach in Structure and Interpretation of Computer Programs: you don't solve your problem directly in the base language you started with. You use that base language to build a vocabulary suited to the problem, and then you express the solution in that new vocabulary. SICP keeps doing this, building little interpreters, evaluators, and abstractions, until "writing the program" looks more like "writing down the answer."

This is the same instinct as designing a domain. You're not really writing Python or TypeScript when you write the domain. Instead, you're writing the language of accounts, orders, and shipments, using Python or TypeScript as the substrate. The names of your types and functions are the vocabulary; the relationships between them are the grammar. Done well, the result reads almost like prose written by someone who understands the business.

DSLs are this idea, made explicit

A domain-specific language takes the pattern to its conclusion: design an actual little language for the problem and write the solution in it. SQL is a DSL for querying relational data. Regular expressions are a DSL for matching text. Make is a DSL for declaring build dependencies. Each one trades generality for the ability to say what it needs to say with almost no ceremony.

You don't need to invent a parser to get the benefits of DSL thinking. An "internal DSL" (sometimes called a fluent interface or an embedded DSL) is just code in your host language that has been shaped to read like the domain. When a routing library lets you write route("/users/:id").get(handleUser), that's an internal DSL for describing HTTP routes. When a testing library lets you write expect(user).toHaveRole("admin"), that's an internal DSL for expectations. The host language is still there, but it has been bent until the surface of the code is domain vocabulary, not language vocabulary.

Hexagonal architecture and the domain at the center

Alistair Cockburn's hexagonal architecture (also called ports and adapters) gives this idea a concrete structural form. The domain sits at the center of the application. Around it are "ports": abstract interfaces describing what the domain needs from the outside world ("save this order somewhere," "notify the customer somehow") and what the outside world can ask of the domain ("place this order"). Around the ports are "adapters": the concrete implementations that connect ports to real technology, like a Postgres adapter, an HTTP adapter, or an SQS adapter.

The crucial property is the direction of dependency. The domain does not import the database client. The domain does not know that HTTP exists. The adapters know about the domain; the domain does not know about them. This means you can swap Postgres for DynamoDB, or REST for gRPC, or your message queue for a different message queue, without touching the part of the code that describes how the business works. And because the domain has no infrastructure dependencies, you can test it with plain function calls, no test containers required.

The test: can a domain expert read it?

The practical heuristic for all of this is readability and discoverability. Someone who understands the business — not necessarily a developer, but at least someone who knows what an order is and what it means to place one — should be able to open the domain code and roughly follow what it does. They shouldn't need to know what an ORM is. They shouldn't have to mentally filter out exception handling, transaction management, and retry logic. The names should be the names they already use. The operations should be the operations they already perform.

Discoverability is the other half. When a new developer joins the team and asks "where does the actual order-placing logic live?", there should be a short, satisfying answer: it lives here, in this folder, in these files. Not "well, some of it is in the controller, some in the service, some in the database stored procedures, and a critical piece is in this cron job." If the domain is scattered, nobody will fully understand it, and changes will be terrifying.

Everything technology-specific (the SQL, the HTTP status codes, the JSON serialization, the retry policies, the cache invalidation) gets pushed behind an abstraction. Not because those things are unimportant; they're often where the bugs live. But because mixing them with domain logic makes both harder to think about. Separated, each can be reasoned about on its own terms. The domain says what the business does. The adapters say how the technology cooperates. Both are clearer for the separation.

That's why the domain matters. It's where the value lives. It's where the bugs that cost real money live. It's what you're actually being paid to get right. The framework will be replaced. The database will be migrated. The cloud provider will be swapped. But the meaning of an order, the rules around a transaction, the constraints on a schedule? Well, those are the substance of the software, and they deserve to live in code that says, plainly and centrally, what they are.

Hexagonal Architecture Should Be Your Default

Ian Johnson — Tue, 12 May 2026 22:33:15 +0000

I think hexagonal architecture should be the default for almost any project bigger than a script. Not because it's trendy or because some book said so, but because the math is wildly in your favor: the cost is tiny and the payoff is large.

Let me make that case.

What you actually pay

Hexagonal architecture asks you for two things:

A port — an interface describing what your domain needs from the outside world. "I need something that can save a User." "I need something that can send an email."
An adapter — a concrete implementation of that port. The Postgres class that saves the user. The SendGrid client that sends the email.

That's it. That's the whole tax.

In statically-typed languages, the port is a literal interface (or trait, or protocol). In interpreted languages, such as Python, Ruby, or JavaScript, you don't technically need to declare anything; duck typing handles it. I still recommend writing the contract down somewhere, even informally as a base class or a Protocol or just a comment block. The reason isn't the compiler. It's that when something goes wrong at 2 a.m. and you're staring at a stack trace, having an explicit named seam in the system makes "where did this go off the rails" a five-second question instead of a five-minute one.

Here's the part people undersell: the adapter is the code you were going to write anyway. You were going to call the database. You were going to hit Stripe. You were going to send the email. The only difference is that this code now sits behind one layer of indirection instead of being splattered through your business logic. You aren't writing extra code. You're writing the same code in a slightly more organized place.

So the real cost is: one interface (free in some languages), one small bit of wiring, and the discipline to call through it instead of around it.

What you get

A lot. Here's the short list.

Testability. This is the big one. Swap your real adapters for in-memory fakes and your domain logic becomes testable in milliseconds, with no database, no network, no Docker container. Tests that used to take a minute take a second. Tests that were flaky stop being flaky because the flake lived in the infrastructure, not the logic.

Swappability. You can replace Postgres with SQLite, SendGrid with Mailgun, REST with gRPC, and your domain doesn't notice. You probably won't do this often...but when you have to, it's a contained change instead of a rewrite. More importantly, you can do it gradually: run both adapters side by side, migrate slowly, roll back trivially.

Predictability. The seams are explicit. You can look at a domain class and know exactly what it depends on — it's right there in the constructor. You can look at the ports and know exactly what the outside world is allowed to do to your domain. Surprise interactions get rare because there are fewer places for them to hide.

Deferred decisions. You can build the domain before you've picked a database. You can model the business before you've picked a queue. Decisions that used to block work become decisions you can make later, with more information, when they actually matter.

Onboarding. New devs read the ports and immediately see what the system needs and what it talks to. The shape of the application is visible in one folder.

Forced clarity. Having to name a port (e.g., "what does my domain actually want from a payment processor?") makes you think about the domain on its own terms instead of in terms of whatever vendor you happen to be using this quarter. You end up with a PaymentGateway interface that says what you need, not a StripeClient reference that says what Stripe offers. Those are different things and the first one ages much better.

Add it up: for the price of an interface and a small bit of wiring, you get fast tests, replaceable infrastructure, clearer code, faster onboarding, and the option to defer decisions. That's an absurdly good trade.

It works for almost any project

People associate hexagonal with web backends, but the pattern is general. Anything that has logic that needs to talk to stuff benefits from separating the two.

CLIs. The domain doesn't care whether it was triggered by argv, a config file, or a cron entry. Inbound adapter parses the inputs, calls the domain, formats the output.
APIs and web services. Obvious fit. More on this below.
Background workers. Same as a web app, just with a queue consumer as the inbound adapter instead of an HTTP controller.
Desktop and mobile apps. Domain doesn't care if the UI is SwiftUI, Qt, or a terminal. The view is an adapter.
Data pipelines. Sources and sinks are adapters. Transformations are domain.
Even libraries. Lighter touch, but the principle of "separate the logic from the IO" still applies.

Where it isn't worth it: one-off scripts, throwaway prototypes, and pure CRUD apps where the "domain" is literally just shuffling rows in and out of a table. In those cases the indirection genuinely costs more than it saves. But that's a smaller share of real-world projects than people pretend.

How I use it for web apps

Concretely, here's the shape:

Domain in the middle. Entities, value objects, and use cases (some folks call them application services, services, interactors, whatever). The use cases are the public API of your application — they're verbs. RegisterUser. PlaceOrder. CancelSubscription. Each one orchestrates the domain to accomplish one meaningful thing.

Inbound adapters at the top. HTTP controllers, GraphQL resolvers, queue consumers, CLI commands. Their job is small and dumb: parse the input, call a use case, format the response. If a controller has business logic in it, that's a smell — push it down into a use case where it can be tested without spinning up a request.

Outbound adapters at the bottom. Repositories that wrap the database. Clients that wrap external APIs. Email senders, file storage, queue publishers, the clock. Each one implements a port the domain owns.

The framework is an adapter, not the architecture. This is the mental shift that matters most. Rails, Django, Express, Spring — these are inbound adapters. They live at the edge. They're not the center of your app; your app's domain is the center, and the framework is a thing that lets the outside world talk to it. Treating the framework this way means your domain isn't married to it, your tests don't need it, and version upgrades don't ripple through your business logic.

What you end up with, in practice:

Controllers stay thin. A few lines each. If a controller is long, work is in the wrong place.
Use cases are testable in isolation with fake adapters. Fast, deterministic, no infrastructure.
Adapters have their own integration tests that hit the real things — actual database, actual HTTP — to confirm the wiring works. Sparse, focused, not where your bulk of test coverage lives.
End-to-end tests stay rare and exist only for the few flows you really want to smoke-test through the whole stack.
Adding a new entrypoint is cheap. Need a CLI version of something the API does? Write a new inbound adapter and call the same use case. Need a webhook? Same.

Wrap up

The price of hexagonal architecture is one interface and one class — and you were going to write the class anyway. In exchange you get fast tests, swappable infrastructure, clearer boundaries, deferred decisions, and a codebase that survives changes in vendors, frameworks, and team members.

I don't think it should be the default for everything. Scripts and trivial CRUD genuinely don't need it. But for almost anything that's going to live longer than a few months or be touched by more than one person, the indirection pays for itself within the first month.

Pay the small tax. Get the big benefits. Hard to think of an easier call.

Why I Prefer Chicago-Style TDD

Ian Johnson — Tue, 12 May 2026 16:04:14 +0000

There are two big schools of TDD, and most devs end up in one without ever really picking sides. Quick recap: London-style (mockist) drives design by mocking collaborators and asserting on interactions. Chicago-style (classicist) builds from the inside out using real objects and asserting on values and state.

I'm firmly in the Chicago camp. Here's why, and how it pairs really nicely with hexagonal architecture.

Mocks belong at boundaries. That's it.

I'm not anti-mock. Mocks are great when you're crossing a boundary you don't control or don't want to hit in a test: the database, an HTTP API, a message queue, the clock, the filesystem. Anything where the alternative is slow, flaky, or has side effects you can't take back.

Inside the boundary? Use real objects. If two domain classes collaborate, let them collaborate. Build them up, call the method, assert on what came out. That's the test.

The moment you start mocking your own internal types, you've stopped testing your system and started testing your assumptions about how your system should talk to itself. Those are very different things.

Mocking hides real problems

When you mock a collaborator, you're hand-writing what you expect it to return. That mock will happily return whatever you told it to, forever, even if the real thing's contract changed three refactors ago. Your test stays green. Production breaks.

Real objects don't let you get away with that. If the collaborator's behavior changes, the test that uses it will notice, because it's actually running through it. You get an early, honest signal. Mocks give you a comfortable, dishonest one.

Good tests don't care about implementation. Mocks force you to care.

This is the part that bugs me most. The whole point of a test, in my view, is "given this input, I expect this output (or this state change)." That's it. I shouldn't have to know, or care, how the code gets there. That freedom is what makes refactoring safe.

Mock-heavy tests destroy that. Now your test is asserting things like "the service called repository.findById exactly once with this argument, then called mapper.toDto, then…" You've baked the implementation into the test. The minute you reorganize the internals, even if behavior is identical, your tests light up red. That's not a useful signal. That's friction.

And here's the kicker: with strict mocking, you don't even need to implement the thing correctly. As long as the calls match the expectations, the test passes. You can satisfy a contract without honoring it. I find that genuinely unsettling. The test isn't proving the code works; it's proving the code makes the right phone calls.

Use real objects. Fake the rest.

My default is: real objects everywhere I can get away with it. When I can't (boundaries, again), I reach for fakes before mocks.

A fake is a real, working implementation that's just simpler. An in-memory repository that stores things in a map instead of Postgres. An email sender that appends to a list instead of calling SendGrid. The fake has actual behavior (you can put things in and get things out) so your test exercises a real interaction, not a scripted one.

For things I need to observe (was this notification sent? did we publish an event?), I use spies. A spy records what happened so I can assert on it after the fact, without dictating the shape of every internal call up front.

Then I test on values. Did the function return what it should? Is the system in the state I expect? Did the right event end up in the spy's recorded list? That's it. No "verify was called with." No call-order assertions. Just inputs and outputs and observable state.

Hexagonal architecture makes this almost free

If you've used hexagonal architecture (a.k.a. ports and adapters), most of the work for Chicago-style TDD is already done.

Quick refresher: your domain lives in the middle. It defines ports — interfaces that describe what it needs from the outside world (a UserRepository, a PaymentGateway, a Clock). Adapters are the concrete implementations that plug into those ports: a Postgres adapter, a Stripe adapter, a system clock.

The domain doesn't know or care which adapter it's running against. It just talks to the port.

In production, you wire up the real adapters. In tests, you wire up fakes: an in-memory UserRepository, a FakePaymentGateway that records charges, a FixedClock you control. Same port, different adapter. The domain has no idea anything changed, which is exactly what you want.

What you get:

Real domain logic actually runs in your tests. No mock-puppeteering. The classes you ship are the classes under test.
Fast tests. No DB, no network, no sleeps. In-memory fakes are essentially free.
Parallel ready. Since there is no contention for the database or network, the tests can be parallelized much easier.
Tests that survive refactoring. Move methods around, rename internals, split a class in two — as long as the port contracts hold and the outputs match, your tests stay green.
Mocks stay at the edges, where they belong. And often you don't even need them, because a well-written fake adapter does the job better.
The fakes become a design tool. If a fake adapter is painful to write, that's usually the port telling you it's badly shaped. Listen to it.

The architecture and the testing style reinforce each other. Hexagonal pushes side effects to the edges; Chicago-style TDD wants the middle to be real and the edges to be swappable. Same idea from two angles.

Wrapping up

Test the behavior of your system through its real objects. Push side effects to the boundary. Swap those boundaries for fakes when you test. Assert on values and state, not on call patterns.

You end up with tests that tell you when something is actually broken, stay quiet when you refactor, and let you change your mind about implementation without paying a tax. That's the whole job.

Mocks are a tool. A useful one, at the edges. But if they're showing up all through your test suite, your tests have stopped describing what your software does and started describing how you currently happen to have written it. Those two things should never be the same.

DEV Community: Ian Johnson

Custom behavior without custom code

What not to do

Code defines the possibilities; data selects among them

Where the data lives

A note on schema

The security cliff

The shape, summarized

Why I prefer docker + make

What Docker gives you

Why Docker alone isn't enough

Make as the single front door

The agent angle

The honest caveats

"It works" has two jobs in software. Sometimes it describes a state. Sometimes it ends a conversation. A short piece on the difference, and what it costs you later.

It werks!

It werks!

What incidentally-working code looks like

The danger is what you build on top

The fix is refactoring, and refactoring needs tests

When "it works" becomes the argument against fixing it

So the next time you hear "it works"

Stop nesting deeply

Early returns flatten the function

Let collections do the filtering

Validate at the edge, trust the middle

Why this matters

Credentials in web applications: how to store them properly

The three kinds of credentials

User passwords: hash, don't encrypt

Session tokens: cookies done right

Service credentials: outside the code, outside git

In CI: GitHub Actions and similar

Setting credentials in staging and production

The client-side trap

A few cross-cutting principles

New post on why testing is no longer optional in the new world of agentic coding, how to start when you have zero tests, and what to do when the code resists being tested at all.

Automated tests are required now

Automated tests are required now

What changes when code volume goes up

Agents reproduce what they find

"But we have no tests"

"But we can't test"

The cycle that opens up

Every bug becomes a test

The strictness point

Stop making excuses

Use exceptions for (wait for it) exceptional things

The avoidance is real

What exceptions actually buy you

And print, die, and bare exit?

The "exceptional" part of the title

The short version

What is the domain and why is it so important?

Build a language, then program in it

DSLs are this idea, made explicit

Hexagonal architecture and the domain at the center

The test: can a domain expert read it?

Hexagonal Architecture Should Be Your Default

What you actually pay

What you get

It works for almost any project

How I use it for web apps

Wrap up

Why I Prefer Chicago-Style TDD

Mocks belong at boundaries. That's it.

Mocking hides real problems

Good tests don't care about implementation. Mocks force you to care.

Use real objects. Fake the rest.

Hexagonal architecture makes this almost free

Wrapping up

And `print`, `die`, and bare `exit`?