zhongqiyue

Posted on Jun 26

How I Got AI to Actually Write Useful Unit Tests (Without Hallucinating)

#ai #python #testing #tutorial

Here’s a confession: I spent three days writing unit tests for a legacy codebase, and I hated every second of it. So when ChatGPT started writing passable code, I thought, "Great, let’s dump this chore on the AI."

I quickly learned that asking an LLM to "write tests for this module" is a recipe for beautiful nonsense – tests that pass against nothing, test for things that don’t exist, or just skip the actual logic entirely. After a lot of trial and error, I found a workflow that actually gives me useful, accurate unit tests. Not perfect – just useful enough to save me hours.

This isn't about a specific AI tool (though I'll mention one I used). It's about the approach: what to feed the AI, how to verify its output, and when to throw it away.

The Problem: Hallucinated Tests

I was working on a Python API that transforms CSV data. The code had a function like this:

def parse_row(row: dict, mapping: dict) -> dict:
    result = {}
    for source_field, target_field in mapping.items():
        if source_field in row:
            result[target_field] = row[source_field].strip()
    return result

I asked an LLM to "write comprehensive unit tests for parse_row". It gave me:

def test_parse_row_basic():
    row = {'name': 'Alice', 'age': '30 '}
    mapping = {'name': 'full_name', 'age': 'age'}
    expected = {'full_name': 'Alice', 'age': '30'}
    assert parse_row(row, mapping) == expected

That looks reasonable. But then it also generated:

def test_parse_row_with_default():
    row = {}
    mapping = {'name': 'full_name'}
    expected = {'full_name': 'default'}  # imaginary default
    assert parse_row(row, mapping) == expected

There's no default logic in my function. The AI invented it. Multiply that by 30 test cases and you’ve got a test suite that gives false confidence.

What I Tried (That Didn't Work)

Better prompts – "only test what exists, don't add behavior". Still got hallucinations.
One-shot with examples – gave a full manual test as example. Better, but still missed edge cases.
Chain-of-thought prompting – "think step by step". Result: long, useless commentary and still wrong tests.
Fine-tuning – too expensive for a one-off project. And you still need curated training data.

What Finally Worked: The Iterative, Verifiable Approach

The key insight: LLMs are great at generating code in a constrained environment where you can verify correctness automatically.

So instead of asking for tests directly, I:

Extract function signatures and docstrings (including type hints).
Feed the AI one function at a time – keep the context narrow.
Generate test stubs with a specific schema (given a function, produce a list of test cases with input/output pairs).
Validate the generated tests by running them against the real function. If any test fails, I know it’s hallucinated – I log it and discard it.

Here’s a simplified version of the script that does this:

import ast
import inspect
import json
from openai import OpenAI  # or any other LLM

client = OpenAI(api_key="sk-...")  # in real life, use env var

def generate_test_cases(func_source: str) -> list:
    """Ask LLM to produce test cases as JSON objects."""
    prompt = f"""Given this Python function, produce a JSON array of test cases. Each test case is an object with "args" (list), "expected" (value), and "description" (string). Only produce tests that directly verify the function's behavior as written. Do NOT add any behavior not in the code.

{func_source}


Return only valid JSON, no markdown."""
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.2
    )
    text = response.choices[0].message.content
    try:
        return json.loads(text)
    except:
        print("Failed to parse JSON, skipping")
        return []

def validate_and_run(func, test_cases):
    """Run each test case. Return only those that pass."""
    good = []
    for tc in test_cases:
        try:
            result = func(*tc["args"])
            if result == tc["expected"]:
                good.append(tc)
            else:
                print(f"FAIL (mismatch): {tc['description']}")
        except Exception as e:
            print(f"FAIL (error): {tc['description']} - {e}")
    return good

def generate_tests_for_function(func):
    source = inspect.getsource(func)
    candidates = generate_test_cases(source)
    passed = validate_and_run(func, candidates)
    # Now produce pytest-style tests from the validated cases
    test_code = []
    for i, tc in enumerate(passed):
        args_repr = ', '.join(repr(a) for a in tc['args'])
        expected_repr = repr(tc['expected'])
        test_code.append(f"def test_case_{i}():\n    assert {func.__name__}({args_repr}) == {expected_repr}")
    return '\n\n'.join(test_code)

Crucially, I run the generated test cases against the actual function. If the AI invented a default value, the test fails at validation and is thrown out. The tests that remain are guaranteed to pass – and they actually test real behavior.

The Trade-offs

No edge-case discovery – The AI only tests what it sees. It won't find weird inputs you didn't think of. You still need human analysis for boundary values, nulls, etc.
Brittle to complex logic – For deeply nested functions with many branches, the AI often produces only the happy path. My validation loop only catches failures, not missing coverage. I need to review coverage reports afterward.
Requires good type hints – Without them, the AI guesses argument types and frequently hallucinates incompatible inputs.
Cost – Each function call costs a few cents. For a large project, that adds up (but still cheaper than my salary for the same work).

When NOT to Use This

If your codebase has zero tests and you need to ship yesterday, this workflow will slow you down. Just write the critical tests by hand.
If the function touches I/O (network, files) – the AI will generate tests that assume mocking, which I haven't automated here.
If your functions are huge (more than 30 lines) – break them down first. The AI loses context with larger chunks.

What I’d Do Differently Next Time

Add coverage analysis – Run coverage.py on the generated tests and report what branches are missed. Feed that back to the AI for a second pass.
Use a tree-sitter AST parser to extract function signatures more reliably than inspect.getsource (especially for class methods).
Parallelise validation – Running tests one by one is slow. I’d batch them and use subprocess to run a temporary pytest suite.

It’s Not Magic, But It’s Useful

I now use this approach for any new module where I need 80% coverage fast. The AI writes the boring mapping tests, and I only have to think about the tricky edge cases. It still feels like cheating – but in a good way.

Has anyone else found a reliable pattern for getting LLMs to generate trustworthy tests? What's your setup look like?

Top comments (3)

WizSebastian • Jun 26

The validate and discard loop is smart. I've been bitten by hallucinated tests more times than I'd like to admit they're worse than no tests because they give false confidence. Going to steal this pattern, thanks for sharing!!

Eva • Jun 26

Using the actual Python runtime to execute and instantly filter out hallucinated tests is a brilliant approach. Most people get stuck trying to fix hallucinations by writing longer, more convoluted prompts, but the LLM still ends up inventing arguments or default behaviors that don't exist. Setting up an automated feedback loop that directly checks the generated inputs against the live function is the only way to be 100% sure the test code isn't just hallucinated garbage. Definitely going to try building a lightweight script like this for my own workflows.

Viktor • Jun 30

The "beautiful nonsense" framing is dead on - tests that pass against nothing are worse than no tests, because they're green.

The failure mode I'd add is sneakier than hallucination: when you paste the implementation and say "write tests for this", the model reads what the code does and asserts exactly that. So it faithfully encodes the current behavior as correct, including the bug that's already in there. A test generated from the implementation is a change-detector, not a correctness check - it goes green on broken code and only screams when you fix it.

What fixed it for me was to stop feeding it the function body and instead feed the signature + a short spec of what it's supposed to do + a couple of real input/output examples. Now the test checks intent, and when impl and intent disagree that's the bug surfacing instead of getting blessed. Feeding the code is how you get tests that lock in your mistakes.