Cursor wrote 14 tests for my feature. Here's what it couldn't see.

#testing #devtools #ai #webdev

Updated April 2026

Last week I let Cursor generate the test suite for a checkout feature I'd just shipped. It wrote 14 AI-generated tests in about 30 seconds. I was genuinely impressed.

Then I read them.

Twelve of the 14 covered the happy path. Some variation of "user has items, user checks out, order is created" — thorough coverage of everything that would have worked anyway. Two caught real regressions I didn't know about: one found that the "back" button after payment threw a JavaScript error, another found that discount codes broke silently with two items in the cart.

Those two were worth having.

What Cursor was actually doing

The tests it wrote came from reading the code. It looked at the functions, inferred expected inputs and outputs, and wrote assertions that matched the implementation. That's exactly what it should do given what it can see.

What it can't see is the user who opens your checkout on an old iPhone, tries to apply a coupon code from an email six months ago, and abandons when the "apply" button stops responding. That flow isn't in the code anywhere. It's in the gap between what you built and how people actually use it.

In infrastructure projects, there are two separate jobs: approving the engineering plans, and inspecting the physical work against those plans. Different people, different timelines. The inspector checks the work against the plans — accurately, consistently, fast. But someone else decides whether the plans account for all the exits. That's the architect's job, done before the inspector shows up. AI test generation is the inspector. Someone still has to write the plans.

Two problems, one confused solution

There are two distinct problems in test coverage:

Authoring: writing tests for flows you've already identified. Tedious, time-consuming, mechanical once you know what to test.

Discovery: figuring out which user flows exist and which deserve a test. Not mechanical. Requires watching real users, reading complaints, or tracing production errors — something that reasons from the outside in, not from the source out.

AI test generation solves authoring. You identify the flows, it writes the code. That's real progress — I'm not dismissing what I got. The two regressions Cursor caught were real, and they were flows that existed in the source code, readable by any tool that could follow the call chain.

Discovery is the harder problem, and it was the harder problem before AI existed. "Which user journeys should I be testing?" is still answered the same way it always was.

What I changed after this

I still use Cursor to generate test stubs. Faster than writing from scratch, reliable on the mechanical gaps.

What I stopped doing: treating "14 tests passed" as a coverage signal. It tells me the code does what the code says it should do. Not that the feature works for the user who shows up with a flow nobody thought to describe.

For the flows I haven't thought of, I need something that starts from the product itself, not the source. Nothing that reads code can close that gap — the user journeys that matter most often don't exist in any code path at all.

AI test generation got a lot better this year. The discovery problem is still the same problem it was.