Luca Bartoccini for Superdots

Posted on Mar 19 • Originally published at superdots.sh

AI Test Generation: Write Better Tests in Half the Time

#tools #testing #engineering

Every developer knows the feeling. You finished the feature. It works. The PR is ready. But the test coverage is thin, and you know it.

You tell yourself you will come back and add tests later. You will not.

Testing debt is one of the most common problems in software development — not because developers do not value tests, but because writing good tests takes time nobody has. AI test generation tools do not eliminate that problem entirely, but they shift the math enough to matter.

Here is how to use AI to generate tests that are actually useful, and how to avoid the traps that make AI-generated tests worse than no tests at all.

The testing debt problem

A 2024 survey by Testim found that 62% of development teams ship code with less test coverage than their own standards require. The reason is almost always time pressure.

Writing a solid unit test suite for a single function can take as long as writing the function itself. Edge cases, error paths, mock setups — it adds up. When the deadline hits, tests are the first thing that gets cut.

The consequences show up later:

Regressions slip through. A change in one module breaks something three layers down. Nobody catches it until production.
Refactoring becomes terrifying. Without tests, changing existing code is a gamble. Teams avoid necessary improvements because they cannot verify nothing broke.
Onboarding slows down. New developers have no test suite to learn from. They read the code, guess at intent, and hope for the best.

AI test generation does not fix the root cause — time pressure will always exist. But it can turn a two-hour testing task into a 30-minute review task. That changes what is realistic.

What AI test generation actually produces

When you point an AI tool at your code and ask for tests, here is what you typically get:

Unit tests

AI is strongest here. Given a function, it generates tests for:

Happy path — standard inputs producing expected outputs
Edge cases — empty inputs, null values, boundary conditions, max/min values
Error handling — invalid inputs, exceptions, error codes
Type variations — different data types that the function should handle or reject

For a function like this:

function calculateDiscount(price, customerType, quantity) {
  if (price <= 0) throw new Error("Invalid price")
  if (quantity < 1) throw new Error("Invalid quantity")

  let discount = 0
  if (customerType === "premium") discount = 0.15
  if (customerType === "wholesale") discount = 0.25
  if (quantity >= 100) discount += 0.05

  return price * quantity * (1 - discount)
}

AI will typically generate 10-15 test cases covering each branch, each error condition, and combinations like premium customer with 100+ items. That is exactly the kind of thorough-but-tedious work that developers skip under time pressure.

Integration tests

AI handles these less reliably but still usefully. Given an API endpoint or a database interaction, it generates tests that check:

Request/response structure
Status codes for various scenarios
Data persistence after operations
Error responses for invalid requests

The catch: AI often generates integration tests that look correct but make wrong assumptions about your infrastructure, database state, or service dependencies. These require more review.

Mock generation

AI is surprisingly good at generating mock objects and test fixtures. It reads your interfaces and produces realistic test data that covers the shapes you need. This alone can save significant setup time.

Step by step: using AI to generate tests for existing code

Step 1: Start with a single file

Do not try to generate tests for your entire codebase at once. Pick one file — ideally a utility module or service with clear inputs and outputs.

Step 2: Provide context

Give the AI tool:

The source file you want to test
Any interfaces or types it depends on
Your testing framework (Jest, pytest, Go testing, etc.)
Your project's test conventions (file naming, describe/it structure, assertion style)

Example prompt:

"Generate unit tests for the following function using Jest. Follow AAA pattern (Arrange, Act, Assert). Include edge cases for empty inputs, boundary values, and error paths. Use descriptive test names that explain the expected behavior."

Then paste your code.

Step 3: Review the output critically

This is where most people go wrong. They paste AI-generated tests into their project and move on. Do not do this.

Check each test for:

Does it actually test the right behavior? AI sometimes tests implementation details instead of outcomes. A test that checks whether a specific internal method was called is brittle. A test that checks the return value is useful.
Are the assertions meaningful? Watch for tests that assert truthy values without checking specifics. expect(result).toBeTruthy() passes for almost anything. expect(result).toEqual({ discount: 0.15 }) actually catches bugs.
Do the mocks make sense? AI can generate mocks that do not reflect how your dependencies actually behave. If your database throws a specific error type, the mock should throw that same type.
Are there duplicate tests? AI often generates multiple tests that check the same behavior with slightly different wording. Remove the duplicates.

Step 4: Add the tests AI missed

AI is good at generating obvious test cases. It is worse at testing:

Business logic edge cases that require domain knowledge ("what happens when a user's subscription expires mid-billing-cycle?")
Race conditions and timing-sensitive behavior
Security-relevant paths (injection, authorization boundaries)
Integration points where your system talks to external services

After reviewing AI output, add 2-3 tests that require knowledge of your system's actual usage patterns. These are the high-value tests that catch real bugs.

Step 5: Run and refine

Run the generated tests. Some will fail — usually because of incorrect assumptions about imports, module paths, or test setup. Fix these issues and note the patterns so you can improve your prompts for next time.

For more on integrating AI into your development workflow, see our AI productivity guide.

AI-generated tests vs. hand-written tests

After using AI test generation on several projects, here is an honest comparison:

Aspect	AI-generated	Hand-written
Coverage breadth	Excellent — catches edge cases humans skip	Good but often misses less obvious paths
Coverage depth	Surface level — tests what the code does, not why	Deeper — tests business intent and invariants
Setup time	Minutes	Hours
Maintenance	Can be brittle if testing implementation details	Usually more stable and intention-revealing
Domain-specific cases	Weak — misses business logic edge cases	Strong — reflects real-world usage
Readability	Decent but sometimes verbose	Varies by developer

The sweet spot is using both. AI generates the baseline coverage (happy path, edge cases, error handling). You add the tests that require understanding of your business domain and system behavior.

This approach consistently produces better coverage in less time than either method alone. For teams looking to automate more of their development workflow, our AI automation guide covers the broader landscape.

Best practices for reviewing AI-generated tests

Name tests for behavior, not implementation

If AI generates a test called test_calls_database_save_method, rename it to test_persists_user_after_registration. Tests should describe what the system does, not how it does it internally.

Delete tests that test the language

AI sometimes generates tests that verify basic language features — checking that a string concatenation works, or that array indexing returns the right element. These add noise without value. Delete them.

Watch for false confidence

A test suite with 95% coverage that only checks happy paths is more dangerous than 60% coverage that includes error paths. AI tends to generate many happy-path variations that inflate coverage numbers without testing the code that actually breaks in production.

Keep the prompt library

When you find a prompt that produces good tests for your codebase, save it. Build a team-shared prompt library organized by test type (unit, integration, API) and framework. This is similar to how we recommend building prompt libraries for writing documentation with AI — reusable prompts compound in value over time.

Run mutation testing

If you want to verify that your AI-generated tests actually catch bugs, run a mutation testing tool (Stryker for JS/TS, mutmut for Python, go-mutesting for Go). For end-to-end testing, frameworks like Playwright and Selenium can validate AI-generated test scenarios against your running application. These tools intentionally introduce bugs into your code and check whether your tests catch them. AI-generated tests often score lower on mutation testing than hand-written tests — which tells you exactly where to add manual tests.

Frequently asked questions

Can AI test generation tools replace writing tests manually?

Not entirely. AI handles the repetitive parts well — basic unit tests, edge cases, error path coverage. But tests that verify business logic, system integration, and real-world usage patterns still need human input. The most effective approach is using AI for baseline coverage and adding manual tests for domain-specific scenarios.

Which AI tools are best for test generation?

For IDE-integrated test generation, GitHub Copilot and Cursor generate tests inline as you code. For batch generation across files, tools like Codium (now Qodo), Diffblue (Java-specific), and CodiumAI's test generation features work well. General-purpose AI assistants (ChatGPT, Claude) work with any language if you provide good prompts and context.

Are AI-generated tests reliable enough for CI/CD pipelines?

Yes, after review. The generated tests themselves run like any other test — they are regular code in your testing framework. The reliability concern is about test quality, not test execution. Review AI-generated tests before merging them, just as you would review any code. Pay particular attention to mock accuracy and assertion specificity.

How do I handle flaky AI-generated tests?

Flaky tests usually come from two sources: timing assumptions (hardcoded delays, race conditions) and external dependencies (tests that accidentally hit real APIs instead of mocks). When an AI-generated test is flaky, check whether it is truly testing your code or testing infrastructure. If it is testing infrastructure, rewrite it as a proper integration test with appropriate setup/teardown, or delete it if the behavior is already covered elsewhere.

Originally published on Superdots.

DEV Community