Ian Johnson

Posted on May 27

A Test Pyramid That Earns Its Confidence

#webdev #testing #tdd #devex

Most test pyramids are aspirational. People draw the triangle, point at "many unit tests, fewer integration tests, very few E2E tests," and then the actual suite turns out to be a pile of slow, overlapping tests that verify the same logic five different ways at five different layers. Push a branch, wait twenty minutes, come back to a flake.

The pyramid is a real concept. The way most teams implement it is not.

I want to talk about the shape of a pyramid that earns its confidence: what each layer is actually for, why duplication across layers is the failure mode, and how environment and data setup differ at each level so the tier stays in its lane.

The principle: nested layers, each validating its own concern

Think of the suite as an onion. The center is the smallest, fastest, most numerous tests. Each layer wraps the previous one, adding a new concern. The outermost layer is the slowest, fewest tests.

The mistake almost every team makes is treating the layers as redundant. "Cover this with a unit test, then again with an integration test, then again with an E2E test, just to be sure." That looks thorough on a coverage report. It is the source of half the CI pain in our industry.

The correct frame is this: each layer validates the concern that only it can validate, and trusts the inner layers to have validated theirs.

A use-case test trusts the adapters work because the adapter tests validated that. An adapter test trusts the use case calling it has correct domain logic because the use-case tests validated that. A frontend test trusts the API contract is honored because the controller tests at the use-case layer validated that. An end-to-end test trusts everything below it has been verified and only confirms that the pieces, when assembled, actually compose.

If you re-test the inner layer's concern at the outer layer, you are paying the cost of the outer layer (slower, less deterministic, more setup) for a verification you already had. Worse, you are coupling outer-layer tests to inner-layer implementation, so a refactor inside breaks both.

The onion only works if each layer respects what is below it.

The four layers, from center out

Here is the actual shape, with the technologies I use. The technologies change project to project. The shape does not.

Layer 1 — Backend "use-case" tier

This is the center of the onion. It is where almost every test lives.

What it tests. Domain logic. Use cases (Actions, in the hexagonal sense). Controllers. Jobs. Console commands. Middleware. Policies. Resources. Anything that is part of the application's reasoning about its own state.

How. Every external dependency is hidden behind a port — an interface owned by the application, named for the use case rather than the technology. In tests, the production adapter is swapped for an in-memory or recording fake that implements the same interface. The collaborator graph is real; the substrate is RAM.

Environment. None. There is no database. There is no Redis. There is no network. There is no migration step. There is no shared fixture. Each test sets up exactly the state it needs by directly seeding the in-memory fakes it depends on, runs the code under test, and asserts on either the return value or the recorded state of an outbound fake.

Data. Per test. A test that needs a user with two orders creates that user and those orders directly on the in-memory repository — three lines, no factories. A test that needs to verify an email was sent reads the array of mailables that the RecordingMailer collected. The whole setup-act-assert loop is in the test file, visible end to end.

Why it dominates the suite. Because almost every interesting question about the application can be expressed in this shape: given this state, when this happens, then what is the new state and what was sent outbound. No database is required to answer that. No HTTP stack is required to answer that. The only reason this tier was historically small is that we had not yet inverted the dependencies. Once you do, the tier swells to dominate, which is what you want.

Wall-clock cost: milliseconds per test, seconds for thousands of them, running across cores in a single process.

Layer 2 — Backend "adapter" tier

The first ring out. Small. Specific.

What it tests. One production adapter per test, verifying that the adapter delegates to its single external boundary correctly. The Eloquent repository writes the row with the right columns. The Guzzle-backed HTTP client sends the right JSON to the right URL with the right auth. The mailer adapter passes the correct mailable to the framework. The observer adapter fires the right port call when a model event happens.

This is the only place "mocks", in the framework-fake sense, belong. Laravel's Mail::fake(), Storage::fake(), Queue::fake(), Bus::fake(), Event::fake(). Guzzle's MockHandler. SQLite :memory: for persistence adapters. They are mocks against the framework, used at the adapter seam, to verify the adapter is constructing the right framework call.

What it does not test. Domain logic. Controller orchestration. Business rules. Cross-component workflows. If a test in this tier uses actingAs or getJson or postJson, it has crossed the line. It is no longer testing the adapter, it is testing the application through the adapter. That is a use-case test, not an adapter test.

(There is one documented exception in my codebase: the adapter that wraps the framework's auth subsystem uses actingAs to set up sessions, because actingAs is literally the public API of the thing it adapts. The exception proves the rule. It is the only one.)

Environment. Per-process. Each test process gets its own SQLite :memory: database, its own Laravel facade state, its own MockHandler. No real network. No shared MySQL. The tier opens no real external connections.

Data. Per test, narrow. An adapter test for a repository creates the schema it needs, inserts a row, calls the adapter's find, asserts on the returned object. No factories shared across tests, no characterization fixtures. The adapter test is verifying a single seam: the data should be the absolute minimum that demonstrates the seam works.

Why it stays small. Because there is exactly one production adapter per port. If you have ten ports, you have ten adapters, and roughly ten adapter test files. The tier scales with infrastructure surface area, not with feature count. Adding a new feature does not add adapter tests; adding a new integration does.

Wall-clock cost: a couple of minutes for a few hundred tests, running sequentially in a single process today, easily parallelizable when you want to push it lower.

Layer 3 — Frontend (Vitest)

The next ring out. The frontend has its own onion-center, parallel to the backend's. This is the layer that catches concerns the backend physically cannot see: the rendered output, the interactive behavior, the conditional UI.

What it tests. React components. Pages. Hooks. Pure utility modules. Anything that lives entirely in the browser's mental model.

What it does not test. The API contract. That is the backend's job, verified by the use-case-tier controller tests, which assert on the JSON the SPA actually receives. The frontend tests trust that contract is correct. They mock the API client module at the system boundary and feed it whatever response shape they need.

Environment. jsdom in a Node process. No browser. No real network. The router is real, wrapped in a MemoryRouter. The component tree is real: never mock child components, never mock UI primitives, never mock the router itself. The only thing you mock is the API client module: the boundary between "your code" and "the network."

Data. Per test. A test that wants to verify the empty state of a list renders correctly does mockResolvedValue([]) and asserts on the rendered text. A test that wants to verify error handling does mockRejectedValue(new Error(...)) and asserts the error UI shows. The fixture is small, declarative, and lives in the test file.

The key discipline: a frontend test is not the place to discover that the API endpoint forgot to return a field. The use-case-tier controller test catches that, because it asserts on the actual JSON. The frontend test merely asserts that given a fixture in the contracted shape, the UI does the right thing. Two layers, two concerns, no overlap.

Wall-clock cost: tens of seconds for a thousand tests, running across cores.

An additional piece that we have in place is a tool that generates TypeScript types from the API responses and then compares them to our data types from the API. If they differ, the pipeline fails. This prevents drift in our backend and frontend contracts.

Layer 4 — End-to-end (Playwright)

The outermost layer. The thinnest. The most expensive.

What it tests. That the layers, when assembled into a running system, actually compose. That the smoke-path of a real user (log in, see their dashboard, click the thing, see the result) works end to end against the real Docker stack.

What it does not test. Edge cases. Validation errors. Empty states. Conditional UI. Per-role permission matrices. Every business rule. Those have already been verified at lower layers. E2E re-verifying them is the most expensive way to learn something you already knew.

Treat this tier as a smoke tier, not a regression tier. If you reach for E2E to catch a regression, you are admitting that one of the inner layers is missing coverage. Add it there instead.

Environment. The full Docker stack. Real MySQL. Real Redis. Real services where sandbox accounts exist. A separate test database that survives between specs. Pre-seeded users created once. OTP bypass for auth so tests can log in without a real inbox.

Data. Long-lived, shared. The test seeder runs once at the start of the suite, creates the cast of users (for each role), and the tests reach into that population to do their smoke checks. Each spec does not recreate state from scratch the way the inner layers do. The smoke check is the assembly, not the unit.

This is the layer where I would mock nothing. The whole point is "does it actually work when the real pieces are wired up." If you mock at this layer, you have built an integration test that lies.

Wall-clock cost: minutes. Tens of specs, not thousands. Run on merge, on deploy, on demand. Run on every push only if there is a real business need for it.

How the layers compose to build confidence

Here is the part most teams miss, and it is the part that makes the whole thing pay off.

Each layer's confidence rests on the inner layers having earned theirs. The chain looks like this:

The adapter tests prove the adapters are correct. When EloquentOrders::find is called with id 7, the adapter returns the row with id 7. When LaravelMailer::send is called with a WelcomeMail, the framework's Mail facade was handed a WelcomeMail.
The parallel use-case tests prove the application reasons correctly given that the adapters are correct. When CreateOrderAction::execute is called with a valid request, it stores an order through ForStoringOrders, sends a notification through ForSendingOrderNotifications, and returns a CreateOrderResult. The test does not care which adapter is wired; it cares that the application called the right port with the right arguments and built the right result.
The frontend tests prove the UI reasons correctly given that the API contract is honored. When the dashboard fetcher returns the contracted shape, the page renders the right widgets, the right labels, the right interactions. The test does not care that the controller produced that JSON; the use-case test already proved the controller produces that JSON.
The E2E tests prove the assembly works. Real Docker, real MySQL, real HTTP: does the user actually get to their dashboard? If the answer is yes, you trust the entire stack. If the answer is no, the bug is almost certainly in the assembly (wiring, infrastructure, config) not in any individual layer's logic, because the lower layers proved those.

The pyramid is doing real work at every level. None of the work is duplicated. Each layer is the cheapest place to ask the question that layer asks.

What this buys you

Three things, in order of importance.

The feedback loop is fast where it matters most. The inner layers run in seconds. You can run them locally on save. You can run them in CI on every push. The cost of asking "did I break it" is approximately zero, so you ask all the time, so you catch breakage when it is one commit old instead of fifty.

The expensive layers stay rare. E2E should be a handful of specs. Adapter tests should be one per adapter. When the outer layers stay narrow, the wall-clock cost of the suite is bounded, and CI does not become the team's bottleneck.

Refactors stay safe. Because each layer tests its own concern through its own interface, you can rewrite the inside of a use case without breaking the controller test. You can swap an Eloquent adapter for a SQL-direct one without touching a use-case test. You can redesign the dashboard component tree without rewriting a single API test. The seams are real, so the tests stay decoupled from the implementation.

The cost of getting it wrong

I have seen teams build pyramids where every layer tests everything. Every controller has a use-case test, a feature test against a real DB, an E2E spec. I have even seen some pyramids that were outright inverted. This is the most expensive way to have the least amount of coverage. Every change cascades through multiple test files. CI is twenty minutes. Refactors are dreaded.

The teams that build pyramids where every layer tests only its concern ship faster, refactor more freely, and trust their suite more. The two pyramids look the same on paper. They diverge dramatically in practice.

The discipline is simple. Before adding a test, ask: what is the smallest layer that can answer this question? Put the test there. Trust the inner layers to handle theirs. Do not re-verify at the outer layer what the inner layer already proved.

The onion is doing the work. Let it.

Beyond the pyramid: other tests worth running

The four layers above are the spine. They are what every application needs. But there are other forms of testing that are valuable, sometimes essential, and the pyramid does not capture them, because they answer different questions than "is the code correct."

Not every team needs every one of these. A small internal tool with three users does not need load testing. A static marketing site does not need penetration testing. But the moment a product takes on certain attributes (high traffic, sensitive data, regulated industries, long-lived investment) these tests stop being optional. A team shipping a public product that needs to stay up under load benefits enormously from performance testing. A team holding user data cannot responsibly ship without penetration testing. Pick the ones that match your product's actual risks.

Performance testing

What it is. Driving the system under realistic or extreme load and measuring its behavior: latency, throughput, error rates, resource consumption, etc. Variants include load testing (expected traffic), stress testing (beyond capacity), and soak testing (sustained load over hours).

Where it sits. Alongside Layer 4, or in its own tier above it. Runs against a full environment.

Target. API endpoints, queue throughput, database query plans, page render time, and anywhere else where a number matters more than correctness.

Who needs it. Teams shipping products that have to stay up under traffic: public-facing apps, payment systems, anything with a real SLO.

Penetration testing

What it is. Adversarial probing of the running system to find security vulnerabilities, such as injection, auth bypasses, IDOR, secrets exposure, and dependency CVEs. Done by humans, often augmented with automated scanners (OWASP ZAP, Burp).

Where it sits. Outside the pyramid. It targets the deployed system, not the code.

Target. Authentication, authorization, input validation, session handling, transport, dependency surface, and anywhere else data crosses a trust boundary.

Who needs it. Anyone storing user data, anyone under a compliance regime (SOC 2, HIPAA, PCI), anyone whose business survival depends on not getting breached. Most non-trivial products.

Mutation testing

What it is. A tool deliberately introduces small changes — flipping a > to a >=, removing a return, negating a boolean — and runs your test suite. If a mutation does not cause a test failure, the suite has a gap. The mutation score is the percentage caught.

Where it sits. Inside Layer 1, as a quality check on the test suite itself.

Target. Your tests. Not your code.

Who needs it. Teams with mature suites who want to know whether their high coverage is real coverage. It is the answer to "100% line coverage, but does it actually test anything."

Contract testing

What it is. Verifying that two services agree on the shape of their interaction without spinning both up. The consumer records request and expected response shapes; the provider verifies it can honor the contract. Tools: Pact, Spring Cloud Contract.

Where it sits. A peer to Layer 2 — same level of cost, different concern. Adapter tests verify your code talks correctly to a library; contract tests verify two services agree.

Target. The interaction surface between services owned by different teams or codebases.

Who needs it. Anyone in a microservices or multi-repo environment where a shared type system no longer keeps producer and consumer aligned.

Accessibility testing

What it is. Verifying the UI is usable by people relying on assistive technology, such as screen readers, keyboard-only navigation, high contrast, and reduced motion. Automated tools (axe-core, Lighthouse) catch a meaningful slice; manual testing catches the rest.

Where it sits. Inside Layer 3 for the automated portion, alongside Layer 4 for the manual portion.

Target. Semantic markup, ARIA attributes, color contrast, focus order, keyboard reachability.

Who needs it. Anyone shipping a UI to the public, anyone with regulatory exposure (ADA, EAA), anyone who cares that their product is usable by everyone.

Visual regression testing

What it is. Capturing screenshots of UI in a baseline state and comparing against subsequent runs. Differences flag a human for review. Tools: Percy, Chromatic, Playwright snapshots.

Where it sits. Alongside Layer 3 or Layer 4, depending on whether you snapshot in jsdom or against a real browser.

Target. Visual output. Layout regressions. Theming. Design-system drift.

Who needs it. Teams with a design system, marketing surfaces, or any product where visual polish is a feature.

Fuzz and property-based testing

What it is. Generating random or systematically varied inputs and asserting invariants hold. Property-based testing (Hypothesis, QuickCheck, fast-check) is the structured version; fuzzing (libFuzzer, AFL) is the brute-force version.

Where it sits. Inside Layer 1. Same speed, same locality, more inputs.

Target. Parsers, serializers, anything taking untrusted input, anything whose edge cases are infinite and unenumerated.

Who needs it. Anyone parsing complex input formats, anyone exposed to malicious input, anyone whose code has invariants that are easier to express than to enumerate.

Chaos engineering

What it is. Deliberately injecting failures into a running system — killing pods, dropping packets, throttling the network, corrupting disks — and observing whether the system degrades gracefully. Tools: Chaos Monkey, Gremlin, LitmusChaos.

Where it sits. Outside the pyramid, in production or a production-like environment.

Target. The system's failure modes, not its happy path.

Who needs it. Teams running distributed systems where partial failure is the normal state and recovery has to be automatic.

None of these replace the pyramid. They wrap around it for the concerns the pyramid does not cover: performance, security, suite quality, cross-team contracts, accessibility, visual polish, edge-case input, resilience. The same discipline applies: each kind of test answers a specific question, and the cost is justified only when the question is one you actually need answered.

Top comments (3)

Gilder Miller • May 28

I like the distinction you make between coverage and confidence. A lot of teams end up building overlapping test suites instead of layered verification, and the result is slow CI pipelines, brittle refactors, and duplicated assertions across multiple layers.

The strongest point here is that each layer should validate only the concern unique to that layer. That boundary discipline is what makes the pyramid actually work in practice.

I also agree with your emphasis on fast inner-loop testing. Once business logic is isolated from infrastructure, meaningful tests become fast enough to run constantly, which changes how confidently teams can iterate and refactor.

Darya Belaya • May 29

The "smallest layer that can answer this question" framing is the right test design heuristic.
A concrete case where it forced the right answer: I needed to verify that when a booking POST returns 500, the error appears in the correct UI element - not just that some error state exists. An API test sees the response, not the rendering. An E2E test would work but runs a full Docker stack to test UI routing logic.
The answer was a frontend test with a mocked route - intercept the POST, return 500, assert on which element received the message. The question was about UI error routing, not backend behavior. That made the layer obvious.

Ian Johnson • May 29

Great concrete example! 💯