Nazar Boyko

Posted on Jul 2 • Edited on Jul 4

AI For Test Generation: Where It Helps And Where It Lies

#ai #testing #webdev #programming

Testing behavior vs implementation details

AI is great at writing tests fast, and good at writing tests that look real but verify the wrong thing. Here's the line between useful scaffolding and confident-sounding test theater, told through unit tests, edge cases, and brittle mocks.

You've been there. You finish a function, paste it into the AI, and ask for tests. Thirty seconds later you have twelve of them. They run. They pass. The coverage badge nudges up by a percent. You feel like you did something useful, and on most days, you did.

Then a bug hits prod, and you scroll back through those twelve tests and realize none of them would have caught it. Some of them couldn't have. A few were testing the implementation rather than the behavior, and the implementation changed in a way that broke the contract but left the tests passing. One of them was mocking the very thing it was supposed to verify. The coverage badge was telling the truth: about coverage. Not about correctness.

This piece is the part of the AI-for-testing story that doesn't fit in the marketing slides. AI is genuinely useful for generating tests. It's also genuinely good at producing tests that look like tests but don't actually verify what you wanted verified. The whole game is knowing which kind you just got.

Where AI Actually Helps

Let's not be cynical. There's a real, durable productivity win in using AI to write tests. It's just narrower than the marketing suggests.

AI is excellent at extrapolating from a clear example. If you've already written one good test that captures the contract, asking the AI to generate ten more along the same axis works almost every time. It picks up your assertion style, your factory functions, your naming convention, and produces a wall of plausible variants in seconds. That's not a small thing. The tenth boundary case is exactly the kind of work an experienced engineer resents writing by hand, and AI eats it for breakfast.

It's also good at the parts of a test file that are mostly typing: setup blocks, teardown, factory helpers, parameterized input tables, mock builders for shapes you've already defined. A senior engineer's hourly rate spent typing const user = { id: 'u_1', email: 'a@b.com', ... } for the fortieth time is one of the easiest cost savings in software, and AI takes it to zero.

And it's a reasonable scaffolder for new code. If you've written a function and you want a starting point (a file with the right imports, the right describe block, three or four skeleton tests with the assertions left as TODO), the AI gets you to the editor faster than starting from a blank file.

Notice what these have in common. In every case, you are still the one supplying the structure that decides what "correct" means. The AI is filling in the body of a contract you wrote. That's the productive mode. Test generation goes off the rails the moment you flip the relationship: the moment you ask the AI to decide what the contract is, not just type it out.

The First Lie: Tests That Verify The Implementation, Not The Behavior

Hand the AI a function and ask it for unit tests. Watch what it does.

It reads the body of your function. It notes the branches. Then it writes one test per branch. The result looks complete: every if, every else, every early return has a corresponding it() block. Coverage hits 100%. Everyone goes home.

The problem is that the AI didn't test what your function is supposed to do. It tested what your function currently does. Those are different artifacts, and the difference is where bugs live.

Consider a discount function. You ship something like this:

src/pricing/discount.ts

export function applyDiscount(amount: number, code: string): number {
  if (code === 'SUMMER25') return amount * 0.75;
  if (code === 'FRIEND10') return amount * 0.9;
  return amount;
}

You ask the AI for unit tests. It gives you:

src/pricing/discount.test.ts

describe('applyDiscount', () => {
  it('applies SUMMER25 as a 25% discount', () => {
    expect(applyDiscount(100, 'SUMMER25')).toBe(75);
  });

  it('applies FRIEND10 as a 10% discount', () => {
    expect(applyDiscount(100, 'FRIEND10')).toBe(90);
  });

  it('returns the original amount for unknown codes', () => {
    expect(applyDiscount(100, 'NOPE')).toBe(100);
  });
});

These tests pass. The function is "tested". And if you change the SUMMER25 multiplier from 0.75 to 0.6 by mistake, exactly one of them fails, telling you that the code does what the code does but never telling you the code does what the business wants.

Now look at the contract you actually needed to verify, the one you'd have written if a colleague had asked you to spec it out:

A 25% discount is applied when SUMMER25 is the active code for this customer.
A 10% discount is applied when FRIEND10 is the active code, but only to first-time buyers.
An unknown or expired code returns the original amount and emits a code_unrecognized event.
The discount is never applied twice to the same cart.
The returned amount is never less than zero, even if a 100%-off code is somehow active.

None of those live in the AI's test file. They couldn't. The AI never saw them. It saw a function body and inferred a contract from the shape of the code, which is exactly the wrong direction. The contract should generate the code, not the other way around.

This is the subtle reason "the AI writes the tests for the function the AI just wrote" is such a tempting and useless move. The AI writes the function under some implicit assumption (say, that discounts are always positive) and then writes tests under the same implicit assumption. The hidden premise sits inside both artifacts, agreeing with itself. Production is the first place that premise gets challenged, and at that point the test suite is on the bug's side.

The fix isn't to stop using AI for unit tests. It's to stop letting the AI decide what the test is verifying. Write the contract first, in one sentence per test, in plain language: "unknown codes return the original amount", "expired codes never apply, even if they're typed correctly". Then hand those sentences to the AI and let it fill in the assertions. The AI is allowed to generate the body of the test. It is not allowed to generate the intent of the test.

The Second Lie: Edge Cases That Sound Like Edge Cases

Ask the AI for edge cases and it will produce them with confidence. Here's a list it will reliably generate for almost any function it sees:

Empty string
null and undefined
Empty array
Array of length 1
Zero, negative one, the integer max
Very long strings
Whitespace-only input
Unicode that "might be tricky"

Every one of these is a real edge case. Every one of these is also the obvious edge case, the kind you'd find on the first page of any "writing better tests" article. They're useful, they should be covered, and you should not be impressed that the AI thought of them.

The bugs that hurt in production almost never come from the obvious edges. They come from domain edges. A short, incomplete list of domain edges the AI will not generate on its own, in roughly increasing order of "yeah, that was the production incident":

The discount code is valid but the customer has already used it twice.
The array of items contains two references to the same product, ordered separately, and the dedupe logic was written assuming order IDs are unique.
The username collation is case-insensitive in MySQL but case-sensitive in your application layer.
A scheduled job runs at midnight UTC, on a server set to local time, on the night the locale crosses DST, so the job runs twice, then not at all.
The idempotency key for retries is hashed from the request body, and the body contains a timestamp field that changes between retries.
A user has two pending password resets; the older one is still valid until the newer one is consumed; the order of consumption matters.
The "soft-deleted" flag is set on a row, but a related foreign key still points to it, and a join silently drops orders from yesterday's report.

None of those are findable by reading the function body. All of them require knowing the system. The AI can be told about them (feed it the domain rules and it'll generate the tests cleanly), but it will never generate them on its own, because it can't. It doesn't know your timezone bug history, your collation quirks, your retry semantics, or the unspoken invariant that nobody documented because everyone in the room when it was decided is gone.

What this means for daily practice is simple. When you ask the AI for "edge cases for this function", you will get the obvious ones. Take them, they're free. But then sit for one minute and write down, by hand, one real domain edge for the function you're testing. Just one. The honest list of "things this function is supposed to handle that I would worry about at 3am" is short, but it's the list that catches the bugs that wake you up.

There's a habit you can build for this. Every time you finish a function, jot a one-line "things I'm scared of" comment somewhere: in a notebook, in a draft PR description, in a // TODO: tests line. Don't filter. Then feed those lines to the AI as the test prompt. You're not asking the AI to think of edge cases; you're asking it to write the assertions for the edge cases you already thought of. The model is much better at the second job.

The Third Lie: Mocks That Make The Test A Lie About Itself

This is the worst of the three, because it's the most subtle and the most loved. AI loves to mock things. Anything with a side effect (a database call, an HTTP request, the clock, the random source, the file system, the message bus) gets mocked by default. The reason makes sense: tests should be deterministic, dependencies should be isolated, and mocking is the canonical way to do both. The AI is following a well-documented pattern.

The pattern fails in two specific ways, and both happen often enough that you can predict them.

Failure one: the mock is wrong, but the test passes anyway. You're testing a function that calls an external API. The AI mocks the API and returns a fixture. The fixture is shaped however the AI thinks the API responds, which is what an example on the internet looked like, not what your specific provider returns. Your function reads response.data.success; the real API returns response.body.ok; the test never notices because the mock was built from your function's assumptions, not from a real call.

src/payments/charge.test.ts

jest.mock('../lib/stripe', () => ({
  charge: jest.fn().mockResolvedValue({
    data: { success: true, id: 'ch_123' }
  }),
}));

describe('chargeCustomer', () => {
  it('returns a charge id on success', async () => {
    const result = await chargeCustomer({ amount: 1000, customerId: 'c_1' });
    expect(result).toBe('ch_123');
  });
});

This test passes. It will keep passing forever. It is also wrong in a way you cannot detect from inside the test file, because the only thing it verifies is that the function reads from the fictional shape the AI dreamed up. The real provider could return { success: false, error: 'card_declined' } in any number of edge cases, and this test would tell you nothing about whether your code handles them.

Mocking is fine. Mocking blind is the problem. Mock at the system boundary against a recorded real response, or keep at least one end-to-end test that exercises the real client against a sandbox. If the AI is mocking the dependency, the mock fixture should come from a real captured response, not from the AI's guess at the shape.

Failure two: the mock freezes the implementation. You have a function that orchestrates work across several internal services. The AI mocks all of them and asserts that each one was called in a specific way:

src/orders/place.test.ts

expect(inventory.reserve).toHaveBeenCalledWith({ productId: 'p_1', qty: 2 });
expect(payments.charge).toHaveBeenCalledWith({ amount: 1000, customerId: 'c_1' });
expect(notifications.send).toHaveBeenCalledTimes(1);

This is a totally normal-looking test. It will catch a regression where you accidentally stop charging the customer. It will also break the moment you refactor the function to call inventory.reserveMany([{...}]) instead of inventory.reserve({...}), even when the new code is correct and the outcome is identical. The test no longer verifies the contract, only the internal call shape. The first thing that happens after a senior engineer refactors a service is that thirty mock-heavy tests turn red, and the next thing that happens is someone in the team starts arguing that the tests are slowing them down. They aren't slowing the team down. These specific tests are.

The pattern that goes wrong here isn't "AI wrote mocks". It's "AI mocked internal collaborators". The healthy rule is roughly: mock at the edge of your system, use real instances of the things inside it. Your function's call to stripe.charge() deserves a mock. Your function's call to your own inventory.reserve() does not. Use the real inventory module, in-memory if needed.

A Go example to make the pattern concrete, because the same trap shows up identically:

orders/place_test.go

func TestPlaceOrder(t *testing.T) {
    inv := mocks.NewMockInventory(t)
    pay := mocks.NewMockPayments(t)
    notif := mocks.NewMockNotifications(t)

    inv.EXPECT().Reserve("p_1", 2).Return(nil)
    pay.EXPECT().Charge(1000, "c_1").Return("ch_123", nil)
    notif.EXPECT().Send(mock.Anything).Return(nil)

    err := PlaceOrder(inv, pay, notif, Order{ProductID: "p_1", Qty: 2})
    require.NoError(t, err)
}

That test is asserting the exact mechanism PlaceOrder uses today. It is not asserting that the order was placed. It is not asserting that the customer was charged the right amount in the eventual world. It is not asserting that an inventory shortage is handled. It is asserting that one specific function was called with one specific signature, full stop. The day someone batches reservations or moves notifications to a queue, every test like this breaks for the wrong reasons.

The AI is not going to make this distinction for you. It will mock everything in sight, including the things you wish it hadn't, because mocking is the path of least resistance and produces tests that pass on the first run.

A Working Discipline

The three lies (implementation-shaped unit tests, shallow edge cases, brittle mocks) have one shared root, and once you see it, you can stop chasing the symptoms. The root cause is letting the AI decide what the test is for.

When you let the AI generate the contract, the implementation, and the verification, you end up with a closed system that's internally consistent and externally wrong. Nothing inside that system can tell you it's wrong, because everything inside it was generated from the same blind spot. The way out isn't to use the AI less. It's to keep one specific job out of the AI's reach.

The job to keep is naming what the test is supposed to prove. One sentence per test, in your own words, before you open the AI chat. The sentence doesn't need to be elegant. "Returns the original amount when the code has been used twice already." That's enough. The AI can write the body, mock the right things, generate twelve variants, do all the typing. The intent stays with you.

For the dependencies, default to real implementations of anything internal: your own modules, your own services, in-memory versions if the real thing is slow. Only mock at the system boundary, and when you do, mock against captured real responses, not against the shape you wish the response had. If you remember nothing else from this piece, remember that an AI's idea of "what the API returns" is a guess shaped by training data, not by your actual provider.

For the edge cases, write one real domain edge by hand for every function. The AI's list of nulls and empties is free; take it. But the one case that actually scares you, the one you'd worry about at 3am if you didn't write a test, that one is yours.

What This Looks Like In Practice

The most honest version of an AI-assisted test workflow is also the least glamorous. You spend two minutes writing down what the function is supposed to do, in plain sentences. You hand those sentences to the AI and let it generate test bodies. You read every assertion. You delete every mock that's mocking your own code. You write one domain-edge test by hand. You commit.

That's the whole loop. It's slower than "AI, write me tests" (by maybe five minutes per function) and it produces a test suite that does what test suites are supposed to do, which is fail when the system stops working. The flashier workflow produces test suites that fail when the system stops looking the same.

If you've been using AI to generate tests and you're not sure which suite you have, there's a fast diagnostic. Pick a function that's covered by an AI-written test. Change a small thing about its implementation: rename an internal method, change the order of two calls, switch a forEach to a for. If your suite still passes, the tests are about behavior. If half the suite turns red without any user-visible change, the tests are about implementation. The first is what you wanted. The second is what the AI gave you by default.

AI test generation is a real productivity win. It's also a tool you can use to prove the wrong thing faster than ever before, and the output looks the same either way: green CI, climbing coverage, a confident summary in the PR. The difference shows up later, somewhere downstream, in the kind of bug where someone says "but every test was passing" and they're right, and that's the problem.

This article was created with the help of AI. The ideas, structure, and opinions are my own, and I checked it for factual accuracy before publishing.

Originally published at nazarboyko.com.

Top comments (19)

Don Johnson • Jul 2

Really enjoyed this — "it inferred a contract from the shape of the code, which is
exactly the wrong direction" is the whole ballgame.

Let me add one thing that turns your diagnostic into a power tool. Your test —
change the implementation without changing behavior; if the tests fail, they test
implementation — is mutation testing done by hand, once. Mutation testing automates
it: it makes thousands of tiny changes to your code and reports which ones your suite
failed to notice. Every surviving mutant is a test that was asleep on the job. It's
the honest answer to "but every test was passing," because it stops trusting the
green bar and asks the only question that matters: who's testing the test?

And that's exactly where AI-generated suites get exposed. Implementation-shaped tests
inflate line coverage while tanking mutation score — 96% coverage / 34% mutation
score is a real and depressingly common gap. The AI wrote a pile of tests that assert
the code's shape, so naturally they survive when you mutate the shape's behavior.
Mutation score is the lie detector this whole article is circling.

There's a nice synergy on top: a surviving mutant is a coordinate — it tells you
where the suite is blind. Property-based / diverse-data generation is the search
that kills it — it tells you what inputs to throw at that spot. Mutants point,
generated data shoots. I wrote that pairing up here:
High-Confidence Testing with Mutation Analysis and Diverse Test Data.

Last thing, on your "shallow edge cases." Timezone boundaries, idempotency
collisions, soft-delete cascades — those aren't input edges, they're schedule
edges. They don't live in a function body, so there's no shape for an AI (or a human
reading code) to infer them from; they only exist in the ordering of events across
the system. That's the beat for deterministic simulation testing: fork
run the same real code across thousands of event orderings, and the rare-schedule bug
stops being a cryptid that vanishes when you look at it. Wrote that on
VOPR: The Multiverse Machine That Kills Production Bugs.

Great piece.

Nazar Boyko • Jul 2

This is a brilliant addition "who's testing the test?" is the question the whole piece was circling without ever naming it. You're right that mutation score is the missing lie detector: the green bar only proves the tests ran, mutation testing proves they were awake. That 96% coverage / 34% mutation gap is basically the article compressed into two numbers.
And your input-edge vs schedule-edge distinction is sharper than how I framed it. Timezone boundaries, idempotency collisions, soft-delete cascades really don't live in a function body for an AI (or a human) to infer, they only exist in the ordering of events. "Mutants point, generated data shoots" is a great way to say it. Both your pieces are going on my reading list.
Thanks for the thoughtful comment! 🙏

Wren Calloway • Jul 3

The refactor diagnostic at the end is good, but it has a failure mode worth naming: it rewards behavioral tests that are still verifying the wrong behavior. A test can survive every implementation change you throw at it and still be asserting your buggy assumption faithfully — the SUMMER25 test passes through any rename, and it's exactly the test that encodes "discounts are always positive" without ever checking it. Surviving refactors proves the test is coupled to behavior, not that it's coupled to the right behavior.

The stronger mutation to run isn't renaming a method — it's changing the outcome. Flip a boundary the way a bug would: return amount when the code is expired instead of erroring, or let the discount go negative. If the suite still passes, you've learned something the refactor test can't tell you, which is that your contract has a hole in it, not just that your tests are implementation-shaped. That's basically hand-rolled mutation testing, and it's the one check that catches the closed-loop problem you spend the whole piece describing — because it attacks the shared assumption directly instead of the code shape around it.

Nazar Boyko • Jul 3

Great point, Wren! Thanks for this. You're right, surviving a refactor only proves the test isn't implementation-coupled, not that it's asserting the right behavior. Flipping a real boundary condition (mutation-style) is the stronger check. Might fold that distinction into the piece.

Wren Calloway • Jul 3

Thank you Nazar, have a great day!!!

Nazar Boyko • Jul 4

You as well!

Theo Valmis • Jul 2

The lie in AI test generation is subtle: it writes tests that pass, which feels like coverage but often just encodes the code's current behavior, bugs included. It's great at the mechanical cases you'd have skipped out of boredom. It's weak exactly where tests matter most, the edge case you didn't think of, because it's reasoning from the same blind spots as the code. Tests generated from the implementation prove it does what it does, not what it should. The value is real for breadth; the trap is mistaking that breadth for correctness.

Nazar Boyko • Jul 2

"Reasoning from the same blind spots as the code" that's the sharpest one-line version of what I was trying to say the long way around. That's the whole trap: the AI infers what to test from the implementation, so it can only ever confirm the code does what it does, never what it should. The breadth is real and genuinely useful; the danger is reading that breadth as correctness.
Thanks for reading!

Kartik N V J K • Jul 2

The "tests what the code currently does, not what it's supposed to do" point is the one I'd underline. Generated tests love to pin the current implementation, so they pass on day one and then block every refactor while catching none of the regressions that matter. Capturing fixtures from recorded real responses instead of letting the model invent them is the step most people skip, and it's what keeps the mocks honest.

Nazar Boyko • Jul 2

Exactly, that's the trap. A green suite on day one feels like progress but it's really just a snapshot of whatever the code happened to do, bugs included. Recording real responses for fixtures is underrated precisely because it's the boring, unglamorous part, and that's where the honesty comes from. Once the mocks drift from reality the whole safety net turns into theater.

Sloan the DEV Moderator • Jul 2

Hey, this article appears to have been generated with the assistance of ChatGPT or possibly some other AI tool.

We allow our community members to use AI assistance when writing articles as long as they abide by our guidelines. Please review the guidelines and edit your post to add a disclaimer.

Failure to follow these guidelines could result in DEV admin lowering the score of your post, making it less visible to the rest of the community. Or, if upon review we find this post to be particularly harmful, we may decide to unpublish it completely.

We hope you understand and take care to follow our guidelines going forward!

Nazar Boyko • Jul 3

Hey, thanks for the heads-up and the link to the guidelines.
The ideas, structure, and opinions in this piece are my own. It's based on real experience writing and reviewing tests, and I checked it for factual accuracy before publishing. I do use AI to help me draft and tidy up my writing, so I'm glad to be transparent about it. I've added a disclosure line to the post per the guidelines. Thanks for keeping the standards high.
happy to adjust anything else if needed.

Octahedron • Jul 7

wow

Vic Chen • Jul 3

Really strong framing. The line that stuck with me was “the AI is allowed to generate the body of the test, not the intent of the test.” That matches what we see in production: model-generated tests look great until a domain invariant shifts and you realize the suite was only mirroring the implementation. I also liked the distinction between obvious edge cases and domain edges — the DST / idempotency examples are exactly the kind of failures that never show up in generic “add more tests” advice. Feels like the practical workflow is: human writes the contract in plain English, AI expands coverage around it, and reviewers stay ruthless about business semantics.

Nazar Boyko • Jul 4

Thanks, that means a lot. And you nailed the workflow better than I did in the post. Human writes the contract in plain English, AI fills in coverage, reviewers stay ruthless about the business rules. That last part is where most teams get lazy and let the suite drift into just mirroring the code. Glad it resonated.

socaity • Jul 6

amazing read, thanks for sharing!

Nazar Boyko • Jul 11

Thank you, really glad you enjoyed it!

Himanshu Agarwal • Jul 2

AI saves time, but humans provide the context, critical thinking, and final confidence before anything reaches production.

Nazar Boyko • Jul 2

Exactly, that's the line I was trying to draw. AI can write the tests fast, but it can't decide what "correct" means for your system. That last mile of context and judgment stays with us, and it's the part that actually keeps things safe in prod. Thanks for reading! 🙏

View full discussion (19 comments)