Alex Cloudstar

Posted on Apr 17 • Originally published at alexcloudstar.com

Testing AI-Generated Code: How to Actually Know If It Works

#ai #devtools #webdev #programming

I shipped a bug to production in January that embarrassed me. Not a subtle bug. A bug where a rate limiting function the AI wrote silently swallowed errors and returned true for every request, which meant our rate limiter was not actually rate limiting anything. The function looked fine on a visual scan. The TypeScript compiled. My quick test of the happy path worked. I merged it.

The rate limiter failure showed up a week later when someone ran 4,000 requests in two minutes and our costs spiked.

Here is what made it worse: the AI (Claude Code, in this case) had actually written a comment in the code that the error handling was a placeholder. I had not read that line carefully. I trusted the overall shape of the code without really testing it.

That experience changed how I approach AI-generated code review. Not by trusting AI less, but by building a real testing process instead of relying on "it looks right."

Why AI Code Needs Different Testing Habits

The danger is not that AI writes bad code. It writes good code most of the time. The danger is that it writes confident-looking code consistently, which dulls your instinct to check carefully.

When a junior developer writes code that does something unexpected, there is a natural flag in your brain. This person is learning. There might be edge cases they missed. You slow down.

AI code does not trigger that flag. It looks like polished, professional code. The variable names are sensible, the function structure is clean, the comments are there. So you scan it the way you would scan code from a senior engineer you trust, not the way you would test code from someone who confidently writes plausible-but-wrong implementations a few percent of the time.

But that few percent matters. On a large codebase where agentic coding tools are writing hundreds of functions per week, a few percent failure rate is a lot of bugs in flight.

The other thing that changes with AI code: the failure modes are different. Human bugs tend to cluster around the things humans find cognitively hard. Off-by-one errors. Race conditions. Forgetting to handle a specific edge case the author did not think of.

AI bugs are often more subtle. The AI knows the edge cases. It will handle them, but sometimes with logic that is plausible rather than correct. It handles the error case by returning a default value that happens to be wrong in production context. It implements a security check correctly for the example in its training data but misses an edge case specific to your implementation.

This means you need tests, not just code review.

The Testing Gap in AI-Assisted Development

There is a pattern I keep seeing. Developers use AI to write application code at 3x to 5x their previous speed. Then they use AI to write tests, but in a way that just adds more code, not more confidence.

"Write tests for this function" produces tests that test the same logic the function implements. The happy path passes. The cases the AI thought of are covered. But the tests are written by the same reasoning process that wrote the code, which means they share the same blind spots.

This is the testing gap in AI-assisted development. You have more code, but not more verification. The test suite looks comprehensive and provides almost no additional safety beyond TypeScript compilation.

Real testing for AI-generated code requires something different: testing driven by your understanding of the problem, not AI's understanding of the code it just wrote.

Static Analysis First

The cheapest form of testing is static analysis. It costs almost nothing to set up and it catches a real category of AI bugs before they reach your test suite.

TypeScript is your first layer, but you need to use it properly. This means strict mode.

{
  "compilerOptions": {
    "strict": true,
    "noUncheckedIndexedAccess": true,
    "exactOptionalPropertyTypes": true
  }
}

The noUncheckedIndexedAccess flag is particularly useful for AI-generated code because AI often writes array access patterns that look correct but do not handle the undefined case. Turning this flag on surfaces those issues immediately.

ESLint with relevant plugins catches things TypeScript misses. If you are writing Node.js backend code, eslint-plugin-security flags common security anti-patterns that AI sometimes introduces. If you are writing React, eslint-plugin-react-hooks catches dependency array mistakes that Claude gets wrong maybe 10% of the time.

Biome is worth considering as a replacement for the ESLint setup if you want a faster, more opinionated static analysis tool. It ships with 200+ built-in rules and runs fast enough to use in a pre-commit hook without slowing down your workflow.

The point is not to run any of these manually. Put them in your CI pipeline and run them as a pre-commit hook locally. AI-generated code should pass static analysis before it is even reviewed.

Testing What AI Gets Wrong

Once you have static analysis in place, you need tests that specifically target the failure modes of AI-generated code. This means thinking adversarially about what the AI might have gotten wrong.

Boundary and Edge Case Testing

AI code often handles the happy path and the most obvious edge cases correctly. It struggles with the boundaries that are specific to your system rather than the general category of problem.

For any function that processes user input or external data, write tests for:

The minimum and maximum valid values
One step outside each boundary (what happens with -1 when 0 is the minimum valid value?)
Empty and null inputs, even if the type system says they should not exist
Inputs that are technically valid but semantically unusual (an email address that is 254 characters long, which is valid per spec)

I have a pattern I call "assume it is wrong at the edges." For every AI-generated function that transforms data, I write at least three tests for inputs outside the expected range before I look at the implementation. This forces me to think about the contract rather than the implementation, and it often catches places where the contract is not what I assumed.

Error Handling Verification

The bug I described at the start was an error handling bug. The function swallowed an exception and returned a default value. This is one of the most common AI bug patterns I have seen: technically valid error handling that is semantically wrong.

Write explicit tests that verify error propagation. Do not just test that the function returns the right thing in the success case. Test that it fails the right way.

it('throws when the rate limit store is unavailable', async () => {
  mockRedis.get.mockRejectedValue(new Error('Connection refused'));
  await expect(checkRateLimit('user-123')).rejects.toThrow('Connection refused');
});

it('does not allow requests through when the store check fails', async () => {
  mockRedis.get.mockRejectedValue(new Error('Timeout'));
  const result = await checkRateLimit('user-123').catch(() => false);
  expect(result).toBe(false); // fail closed, not fail open
});

This is different from just testing the happy path. You are testing the failure contract. AI code that swallows errors and returns defaults will fail these tests immediately.

Concurrency and Race Conditions

This is the failure mode most likely to survive review and reach production. AI code often handles single-threaded logic correctly while introducing subtle race conditions in concurrent scenarios.

If you are writing any code that deals with shared state, queues, or async operations that could run in parallel, write tests that explicitly check concurrent behavior.

it('correctly handles concurrent rate limit checks for the same user', async () => {
  const results = await Promise.all([
    checkRateLimit('user-123'),
    checkRateLimit('user-123'),
    checkRateLimit('user-123'),
  ]);

  const remainingCounts = results.map(r => r.remaining);
  expect(remainingCounts[0]).toBeGreaterThan(remainingCounts[2]);
});

Concurrency bugs are hard to reliably reproduce through testing, but making the intent explicit in your test suite at least forces you to think about the concurrent behavior and documents the expected contract.

Integration Tests for AI-Written Modules

Unit tests catch individual function bugs. Integration tests catch the bugs that emerge when AI-written code interacts with your actual system.

The place AI code most commonly breaks in integration is at the boundary with external services. The AI knows the general shape of how an API works. It might not know your specific version's behavior, your account's limits, or the edge cases in how the service responds to malformed requests.

Write integration tests that actually hit your services in a staging environment. Not mocked versions. Real calls.

For database operations specifically, this means tests that:

Actually write to and read from a test database
Check that transactions roll back correctly when something fails mid-way
Verify that index usage is correct by checking query plans for slow-path queries

For external API calls:

Run against a sandbox or staging environment where the API supports it
Test response handling with actual API responses, not hardcoded response bodies the AI invented
Verify that retry logic works by intentionally inducing failures

I know this is more setup than mocking. It is worth it. Production observability catches bugs after they hit users. Integration tests against real services catch a class of bugs that unit tests cannot, before they ship at all.

Property-Based Testing for Complex Logic

If you have never used property-based testing, AI-generated code is a good reason to start. The idea is simple: instead of writing specific test cases, you describe properties that should always hold, and the testing framework generates hundreds of random inputs to verify those properties.

For AI-generated parsing, validation, or transformation code, property-based tests are particularly powerful because the AI's blind spots tend to be in the input space, not the logic space.


test('rate limiter never allows more requests than the limit', () => {
  fc.assert(
    fc.asyncProperty(
      fc.integer({ min: 1, max: 1000 }),
      fc.integer({ min: 0, max: 10000 }),
      fc.string({ minLength: 1 }),
      async (limit, requestCount, userId) => {
        const limiter = createRateLimiter({ limit, windowMs: 60000 });
        let allowedCount = 0;

        for (let i = 0; i < requestCount; i++) {
          const allowed = await limiter.check(userId);
          if (allowed) allowedCount++;
        }

        return allowedCount <= limit;
      }
    )
  );
});

This test generates thousands of random combinations of limits, request counts, and user identifiers. If your rate limiter ever allows more requests than the configured limit for any combination, it fails. This is a much stronger guarantee than writing five specific test cases.

The fast-check library is the best TypeScript option. For Python, hypothesis is the standard. Both integrate well with standard testing frameworks.

The Code Review Step That Actually Matters

Tests catch what you specify. Code review catches what you did not think to specify.

The AI code review that matters is not checking for style or obvious bugs. TypeScript and linting handle those. The review that matters is checking semantic correctness: does this code do what we actually need it to do?

This requires reading the code with the problem in mind, not the implementation. Ask yourself:

What is this code supposed to guarantee? Not what does it do, but what invariant does it enforce? If you cannot state that clearly, the code is not ready to ship regardless of whether it looks right.

What happens to the user if this fails? If the function returns a wrong value, what does the downstream code do? If it throws an exception, where does that get caught? Tracing the failure path through the system surfaces bugs that isolated code review misses.

What changed compared to what was there before? Diff review is natural for human-written code. With AI-generated code, especially when an agent refactors or extends existing functionality, the diff can be large and the subtle behavioral change can be in a small part of a big diff. Read the diff. Do not just read the final file.

The technical debt that accumulates from AI-generated code that was not reviewed properly is particularly insidious. It looks like clean code. It behaves mostly correctly. And it hides architectural problems that compound over months.

Building the Testing Habit Into Your Workflow

The testing process I have described is not a one-time thing you do when you remember. It needs to be part of how you work with AI coding tools, not something you bolt on afterward.

Here is the workflow I have settled on after months of intensive AI-assisted development:

Before asking the AI to write code, write the test specification. Not the tests themselves, but a list of what behaviors the tests will need to verify. This forces you to think about the contract before you see the implementation.

While the AI is writing the code, write the edge case tests. You know the problem. You know where the boundaries are. Write those tests before you read what the AI produced.

When reviewing the AI output, run the tests you wrote first. Do not start by reading the code. See which tests pass and which fail. Then read the code to understand why the failing tests failed.

Before merging, run static analysis, your unit tests, and at least the integration tests that cover the new code. Not as a formality, as an actual gate.

This sounds like more process than it is. Once it becomes habit, the overhead is maybe 20% additional time compared to just accepting AI output. The time saved from not debugging production bugs more than pays for that overhead.

The AI Evals Connection

There is a related skill that goes beyond testing individual functions: evaluating AI behavior at the system level. If you have built any AI-powered features into your product, you need a way to measure whether those features are actually working correctly across a range of inputs.

I wrote about AI evals for solo developers in more depth, but the principle is the same: systematic verification beats eyeballing outputs. The same instinct that makes you write unit tests for your application code should make you want structured evaluation for any AI behavior you are shipping to users.

The security risks specific to AI-generated code are also worth understanding as a separate category. Not all security issues will be caught by the testing approaches I described here. Prompt injection, over-permissioned tool access, and data leakage through AI context are security categories that need their own review practices.

What to Do Right Now

If you are using AI coding tools heavily and do not have a real testing process in place, start here.

First: add strict: true and noUncheckedIndexedAccess: true to your TypeScript config. Run the build. Fix the errors. These are bugs, not style choices.

Second: pick the three most important functions that AI has written in your codebase in the last month. Write explicit error propagation tests for each one. If any of them fail, you now know you have a bug in production.

Third: add fast-check to your test dependencies and write one property-based test for the most complex data transformation in your codebase. Run it and see what happens.

The goal is not 100% coverage or a comprehensive testing strategy document. The goal is to stop trusting AI code purely on visual inspection and start having enough automated verification that you can be confident the code does what you need it to do.

Fast AI code and reliable AI code are not the same thing. A real testing process is what bridges the gap.

DEV Community