Why We Mock Everything

#testing #architecture #softwaredevelopment #freebsd

On Second Thought — Episode 09

A new test file is opened. Before a single assertion is written, a world is constructed: a fake database, a fake clock, a fake mailer, a fake queue. Forty lines of fakes, two lines of actual test. A test with no mocks at all would look, to most teams in 2026, almost negligent. This is the ninth episode of On Second Thought, a series about the daily routines we perform without ever quite deciding to. Today's routine is the one that begins every test we write: we fake the world in order to examine our own logic.

A note on what this is not. I have written before, in another series, that a test built entirely of mocks verifies your mocks rather than your software, and that the integration test you avoided would have caught the broken checkout. That remains true, and it is the easy half of the argument. This essay is about the harder half, the question the mock-versus-integration debate steps around: why does the code need to be mocked at all?

The Axiom

The reflex is universal. To test a unit, first isolate it; to isolate it, fake everything it touches. The mock is so deeply routine that a test which simply calls a function with values and checks the result looks suspiciously simple, as though the author had forgotten to do the hard part. We have arrived at a place where constructing a parallel universe of fakes is the normal cost of examining a single function, and where not doing so reads as negligence.

That feeling, that a mock-free test is somehow incomplete, is the topic.

The Origin

The unit test was born pure, and its purity is the part we forgot.

Kent Beck wrote SUnit for Smalltalk in 1994, and with Erich Gamma adapted it into JUnit in 1998. The code it was designed to test was library code: parsers, algorithms, data structures, pure functions. You call a function, you get a result, you compare the result to what you expected. There was nothing to mock because there was no world to touch. A sorting routine does not have a database. A parser does not send email. The "unit" in unit test meant a unit of logic with stable inputs and stable outputs, and the technique fitted it exactly.

Then the practice was carried, wholesale, into application code. And application code is nothing but world. An endpoint reads a request, queries a database, calls a payment API, writes to a queue, reads the clock, sends an email, and returns a response. There is no pure unit hiding in there to isolate. There is a sequence of side effects with a little logic distributed between them.

To keep calling tests of this code "unit" tests, we had to fake the world it lived in. So we mocked the database, mocked the API, mocked the clock, mocked the mailer. And because you cannot mock what you cannot replace, dependency injection arrived to give every collaborator a seam through which a fake could be inserted. We then reshaped our code, often considerably, so that the fakes would fit: interfaces extracted for single implementations, constructors widened to accept their own dependencies, factories and containers introduced to wire it all together. A great deal of the architecture of a modern application exists so that its tests can substitute fakes for its collaborators.

The mockist tradition made this explicit and respectable. Martin Fowler's 2007 essay "Mocks Aren't Stubs" named the two camps: classicists, who use real objects where they can and test final state, and mockists (the London school, codified in Freeman and Pryce's Growing Object-Oriented Software, Guided by Tests, 2009), who test interactions by mocking collaborators. The London school is a coherent and disciplined practice. But its centre of gravity is the assumption that mocking collaborators is the normal way to test, and from there the industry generalised: by the late 2010s, in a great deal of writing and tooling, "unit test" had quietly come to mean "test with the dependencies mocked out". The redefinition happened without a vote.

All of this was a reasonable response to a real constraint. If logic and I/O are interleaved, mocking is genuinely the only way to exercise the logic in isolation. The origin of the routine is not foolish. The unasked question is one level down: we accepted that logic and I/O had to be interleaved, and then spent two decades building tools to test around the interleaving, rather than asking whether the interleaving was necessary.

The Cost

The cost arrives in three layers, and only the first is the one usually discussed.

The first layer is the daily friction. The mock setup runs to forty lines; the assertion, to two. The ratio is not an exaggeration; it is the ordinary shape of a test for a method with four collaborators. The setup is also fragile in a particular and maddening way: because mockist tests assert on how the unit calls its collaborators, they are coupled to the structure of the code, not to its behaviour. Rename a method, change the order of two calls, extract a helper, and fifty tests turn red although the observable behaviour has not changed by one bit. The tests have become a tax on refactoring, which is precisely the opposite of what tests are for. A test suite that punishes you for improving the code is training you not to improve it.

The second layer is false confidence, and it is the one that reaches production. You have four hundred tests. They are green. You deploy. The checkout is broken. Not one test exercised the real flow, because every dependency was a mock, and the mocks behaved impeccably. The real database had a constraint the mock did not model; the real API returned a shape the mock did not return; the real clock crossed a daylight-saving boundary the mock never saw. The green suite certified the fidelity of your fakes to your assumptions, which is not the same thing as the correctness of your software, and was never going to be.

The third layer is the one this series exists to point at, because nobody bills it. We add interfaces, seams, injection points and indirection that exist for no reason whatsoever except to receive a mock. The design grows extra joints so the fakes have somewhere to clip on. A class acquires an interface with exactly one production implementation, purely so a test can supply a second. A function takes its dependencies as parameters it never varies in production, purely so a test can vary them. The shape of the code is being dictated by the testing strategy, and the testing strategy assumes mocks, so the code is being shaped, quietly and pervasively, by the need to be mockable. The number of mocks in a test is a fairly precise measurement of how thoroughly computation and I/O have been stirred together in the code under test. A high mock count is not a property of the test. It is a reading taken from the design.

The Question

Here is the part worth sitting with, and it starts somewhere unfashionable: the most testable software ever written has no mocks in it at all.

A Unix filter is the example. grep, awk, sort, cut, tr, the small sharp tools that fill the FreeBSD base system, are pure in the way that matters: text in, text out. You test one by handing it input and reading what comes back. printf 'b\na\n' | sort returns a then b, and that is the whole test, with no fake filesystem, no mock kernel, no injected clock, because the filter does not reach into the world. The kernel opened the files; the shell wired the pipes; redirection moved the bytes. The filter only computed. Isolation was not achieved by faking the world; it was achieved by not touching it.

That arrangement is older than the vocabulary we now use for it. Doug McIlroy's pipe (1973) and the filter model it enabled are, in modern terms, a functional core with an imperative shell: the programs are the pure core, and the shell, the redirection, the kernel's file handling are the impure shell that does the talking to the world. The Unix tradition separated computation from I/O at the level of the operating system, four decades before anyone needed to give the pattern a conference title.

The conference titles, when they came, described the same shape. Gary Bernhardt's talk "Boundaries" (SCNA, 2012) named it functional core, imperative shell: put all the decisions and logic in a pure, value-in-value-out core, and confine the side effects to a thin imperative shell wrapped around it. Alistair Cockburn's hexagonal architecture, which he renamed ports and adapters in 2005, draws the same line differently: the application core talks only to abstract ports, and the adapters that touch the database, the network and the user interface live at the edges.

The consequence for testing is the entire point. A pure functional core needs no mocks, because it touches nothing: you test it the way you test sort, with values in and values out, and the tests assert on behaviour rather than on interactions, so they survive refactoring. The imperative shell does touch the world, but there is very little of it, and what there is gets tested with a small number of real integration tests against the real database and the real API, the few tests that actually catch the broken checkout. The mock-heavy world and the mock-free world test the same application. One of them spends forty lines faking a universe; the other spends none, because it separated the logic from the universe first.

So the honest question is not how to write better mocks, or where exactly the line between unit and integration should fall. Those are questions about the symptom. The question underneath is about the design: what if the urge to mock was never a testing problem at all, but the code quietly confessing that the computation and the world were stirred together, and could have been kept apart?

A mock, on second thought, is not a tool. It is a reading on a gauge. It tells you, with some precision, how far your logic has been allowed to wander into the world, and how much of the world you must now counterfeit to get it back.

Read the full article on vivianvoss.net →

By Vivian Voss — System Architect & Software Developer. Follow me on LinkedIn for daily technical writing.

Top comments (1)

Mikhail Golikov • May 29

The "mock everything" reflex bit my team last year. We were testing a conversational interface with deeply mocked LLM responses, and the suite passed every run while production kept failing on multi-turn coreference. Mocks had been frozen against the "happy turn 1" response shape; turns 3-7 had drifted in production and our fakes never caught up.

The fix was contract testing the mock generators against real production traffic samples weekly. Mock is fine until the boundary it stands in for changes shape without telling you. Curious how you handle staleness detection on your mock layer.