DEV Community

zhongqiyue
zhongqiyue

Posted on

How I Got AI to Actually Write Useful Unit Tests (Without Hallucinating)

Here’s a confession: I spent three days writing unit tests for a legacy codebase, and I hated every second of it. So when ChatGPT started writing passable code, I thought, "Great, let’s dump this chore on the AI."

I quickly learned that asking an LLM to "write tests for this module" is a recipe for beautiful nonsense – tests that pass against nothing, test for things that don’t exist, or just skip the actual logic entirely. After a lot of trial and error, I found a workflow that actually gives me useful, accurate unit tests. Not perfect – just useful enough to save me hours.

This isn't about a specific AI tool (though I'll mention one I used). It's about the approach: what to feed the AI, how to verify its output, and when to throw it away.

The Problem: Hallucinated Tests

I was working on a Python API that transforms CSV data. The code had a function like this:

def parse_row(row: dict, mapping: dict) -> dict:
    result = {}
    for source_field, target_field in mapping.items():
        if source_field in row:
            result[target_field] = row[source_field].strip()
    return result
Enter fullscreen mode Exit fullscreen mode

I asked an LLM to "write comprehensive unit tests for parse_row". It gave me:

def test_parse_row_basic():
    row = {'name': 'Alice', 'age': '30 '}
    mapping = {'name': 'full_name', 'age': 'age'}
    expected = {'full_name': 'Alice', 'age': '30'}
    assert parse_row(row, mapping) == expected
Enter fullscreen mode Exit fullscreen mode

That looks reasonable. But then it also generated:

def test_parse_row_with_default():
    row = {}
    mapping = {'name': 'full_name'}
    expected = {'full_name': 'default'}  # imaginary default
    assert parse_row(row, mapping) == expected
Enter fullscreen mode Exit fullscreen mode

There's no default logic in my function. The AI invented it. Multiply that by 30 test cases and you’ve got a test suite that gives false confidence.

What I Tried (That Didn't Work)

  1. Better prompts – "only test what exists, don't add behavior". Still got hallucinations.
  2. One-shot with examples – gave a full manual test as example. Better, but still missed edge cases.
  3. Chain-of-thought prompting – "think step by step". Result: long, useless commentary and still wrong tests.
  4. Fine-tuning – too expensive for a one-off project. And you still need curated training data.

What Finally Worked: The Iterative, Verifiable Approach

The key insight: LLMs are great at generating code in a constrained environment where you can verify correctness automatically.

So instead of asking for tests directly, I:

  1. Extract function signatures and docstrings (including type hints).
  2. Feed the AI one function at a time – keep the context narrow.
  3. Generate test stubs with a specific schema (given a function, produce a list of test cases with input/output pairs).
  4. Validate the generated tests by running them against the real function. If any test fails, I know it’s hallucinated – I log it and discard it.

Here’s a simplified version of the script that does this:

import ast
import inspect
import json
from openai import OpenAI  # or any other LLM

client = OpenAI(api_key="sk-...")  # in real life, use env var

def generate_test_cases(func_source: str) -> list:
    """Ask LLM to produce test cases as JSON objects."""
    prompt = f"""Given this Python function, produce a JSON array of test cases. Each test case is an object with "args" (list), "expected" (value), and "description" (string). Only produce tests that directly verify the function's behavior as written. Do NOT add any behavior not in the code.

Enter fullscreen mode Exit fullscreen mode

{func_source}


Return only valid JSON, no markdown."""
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.2
    )
    text = response.choices[0].message.content
    try:
        return json.loads(text)
    except:
        print("Failed to parse JSON, skipping")
        return []

def validate_and_run(func, test_cases):
    """Run each test case. Return only those that pass."""
    good = []
    for tc in test_cases:
        try:
            result = func(*tc["args"])
            if result == tc["expected"]:
                good.append(tc)
            else:
                print(f"FAIL (mismatch): {tc['description']}")
        except Exception as e:
            print(f"FAIL (error): {tc['description']} - {e}")
    return good

def generate_tests_for_function(func):
    source = inspect.getsource(func)
    candidates = generate_test_cases(source)
    passed = validate_and_run(func, candidates)
    # Now produce pytest-style tests from the validated cases
    test_code = []
    for i, tc in enumerate(passed):
        args_repr = ', '.join(repr(a) for a in tc['args'])
        expected_repr = repr(tc['expected'])
        test_code.append(f"def test_case_{i}():\n    assert {func.__name__}({args_repr}) == {expected_repr}")
    return '\n\n'.join(test_code)
Enter fullscreen mode Exit fullscreen mode

Crucially, I run the generated test cases against the actual function. If the AI invented a default value, the test fails at validation and is thrown out. The tests that remain are guaranteed to pass – and they actually test real behavior.

The Trade-offs

  • No edge-case discovery – The AI only tests what it sees. It won't find weird inputs you didn't think of. You still need human analysis for boundary values, nulls, etc.
  • Brittle to complex logic – For deeply nested functions with many branches, the AI often produces only the happy path. My validation loop only catches failures, not missing coverage. I need to review coverage reports afterward.
  • Requires good type hints – Without them, the AI guesses argument types and frequently hallucinates incompatible inputs.
  • Cost – Each function call costs a few cents. For a large project, that adds up (but still cheaper than my salary for the same work).

When NOT to Use This

  • If your codebase has zero tests and you need to ship yesterday, this workflow will slow you down. Just write the critical tests by hand.
  • If the function touches I/O (network, files) – the AI will generate tests that assume mocking, which I haven't automated here.
  • If your functions are huge (more than 30 lines) – break them down first. The AI loses context with larger chunks.

What I’d Do Differently Next Time

  • Add coverage analysis – Run coverage.py on the generated tests and report what branches are missed. Feed that back to the AI for a second pass.
  • Use a tree-sitter AST parser to extract function signatures more reliably than inspect.getsource (especially for class methods).
  • Parallelise validation – Running tests one by one is slow. I’d batch them and use subprocess to run a temporary pytest suite.

It’s Not Magic, But It’s Useful

I now use this approach for any new module where I need 80% coverage fast. The AI writes the boring mapping tests, and I only have to think about the tricky edge cases. It still feels like cheating – but in a good way.

Has anyone else found a reliable pattern for getting LLMs to generate trustworthy tests? What's your setup look like?

Top comments (1)

Collapse
 
wizsebastian profile image
WizSebastian

The validate and discard loop is smart. I've been bitten by hallucinated tests more times than I'd like to admit they're worse than no tests because they give false confidence. Going to steal this pattern, thanks for sharing!!