"🧪 Good Enough Tests Died When AI Started Writing Them"

#testing #java #testcontainers #springboot

For years, integration tests got a quiet pass. "Good enough, not perfect" was an acceptable answer, because writing them by hand was expensive and somebody had to ship the feature. That excuse just expired. AI writes these tests now, the authoring cost collapsed, and the bar moves from good enough to exact.

Before the how, the where.

This is about the middle of the pyramid

Three layers, and this post is only about one of them.

Unit tests sit at the bottom. Fast, pure, no infrastructure. Not the topic here.
Integration tests sit in the middle. Your code against real infrastructure and mocked externals. This is the layer everyone gets wrong.
End-to-end tests sit at the top. A thin layer that hits the real vendor sandbox and proves the whole thing actually works.

Almost nobody does the middle layer properly. Teams fall into one of two traps. They mock everything, so the suite stays green while production breaks, because the mocks drifted and the tests only proved their own assumptions. Or they skip the middle entirely and lean on a handful of e2e tests to somehow cover the gap. A disciplined integration layer is rare.

The rule: split by ownership

The fix is not "mock or real". It is one question asked per dependency: do I own this?

Ownership decides fidelity. If you own it, run it for real. If you do not, mock it at the wire. That single cut resolves almost every "should I mock this" argument.

Infra you own: run it for real, at the prod version

Your database, your message broker, your cache, your sibling services. Run them in containers that match production. Not H2 standing in for Postgres. Not an embedded broker. The real engine.

Two ways to get there. Testcontainers spins real containers from inside the test:

@Container
static PostgreSQLContainer<?> db =
    new PostgreSQLContainer<>("postgres:16.3");

@Container
static KafkaContainer kafka =
    new KafkaContainer("apache/kafka:3.8.0");

Or a docker-compose that brings the whole stack up once and lets the suite run against it:

services:
  postgres:
    image: postgres:16.3          # match prod exactly
  kafka:
    image: apache/kafka:3.8.0     # KRaft, no ZooKeeper
    environment:
      KAFKA_PROCESS_ROLES: broker,controller

Both are valid. Testcontainers gives you per-test lifecycle and zero shared state. Compose gives you one warm stack and faster local loops. Pick per project.

The part people skip is the version. Matching the engine is not enough, you have to match the version. Production runs Kafka in KRaft mode, and the test suite still runs Kafka plus ZooKeeper from a template somebody copied in 2021. That drift used to be tolerable. It is not anymore. The behavior, the configs, the failure modes differ. Pin the test image to what production runs, and bump it when production bumps.

The third-party API you do not own: WireMock the wire

For the external HTTP API you do not control, mock it at the wire. WireMock is the tool (MockServer works too). It stands up a fake HTTP server, so you stub the response and verify the call:

stubFor(post("/charges")
    .willReturn(okJson("{ \"id\": \"ch_1\" }")));

// your real client runs against the fake server

verify(postRequestedFor(urlEqualTo("/charges"))
    .withRequestBody(matchingJsonPath("$.amount")));

The critical word is wire. Stub the wire, not your Java client. The moment you mock your own client class, you mock away serialization, the retry policy, timeouts, and error mapping. That is exactly the code most likely to carry the bug. WireMock leaves all of it running and only fakes the server on the other end. And in the AI-era bar, every outbound call gets both a stub and a verify, so a silently-dropped or malformed request fails the test instead of slipping through.

The safety net: a wire mock can lie

A stub is a snapshot. The day the vendor changes their contract, your mock keeps returning the old shape and your green suite is now fiction. This is the real cost of mocking, and pretending otherwise is dishonest.

That is precisely the job of the thin e2e-against-sandbox layer at the top of the pyramid. It runs against the vendor's real sandbox, out of band from the main suite, and catches the drift the mock cannot. Back it with recorded real responses or contract tests so the stubs stay honest.

The honest trade-off

AI collapsed the cost of writing these tests. It did not collapse the cost of running them.

You still pay for Docker in CI and the startup seconds each container costs, which you mitigate with container reuse, singletons, or a compose stack that comes up once. You now owe version upkeep every time production upgrades, because a pinned image that drifts behind prod is its own quiet lie. And the wire mocks still rot, so the e2e-on-sandbox layer is not optional.

The bar is higher because the labor is finally cheap. The runtime and maintenance bill is still real, and you should budget for it instead of pretending the AI made testing free.

Now that AI writes the tests, is "good enough" coverage still an acceptable answer? And has a lying mock or a version mismatch ever shipped a bug past your integration suite?