Here’s a confession: I spent three days writing unit tests for a legacy codebase, and I hated every second of it. So when ChatGPT started writing passable code, I thought, "Great, let’s dump this chore on the AI."
I quickly learned that asking an LLM to "write tests for this module" is a recipe for beautiful nonsense – tests that pass against nothing, test for things that don’t exist, or just skip the actual logic entirely. After a lot of trial and error, I found a workflow that actually gives me useful, accurate unit tests. Not perfect – just useful enough to save me hours.
This isn't about a specific AI tool (though I'll mention one I used). It's about the approach: what to feed the AI, how to verify its output, and when to throw it away.
The Problem: Hallucinated Tests
I was working on a Python API that transforms CSV data. The code had a function like this:
def parse_row(row: dict, mapping: dict) -> dict:
result = {}
for source_field, target_field in mapping.items():
if source_field in row:
result[target_field] = row[source_field].strip()
return result
I asked an LLM to "write comprehensive unit tests for parse_row". It gave me:
def test_parse_row_basic():
row = {'name': 'Alice', 'age': '30 '}
mapping = {'name': 'full_name', 'age': 'age'}
expected = {'full_name': 'Alice', 'age': '30'}
assert parse_row(row, mapping) == expected
That looks reasonable. But then it also generated:
def test_parse_row_with_default():
row = {}
mapping = {'name': 'full_name'}
expected = {'full_name': 'default'} # imaginary default
assert parse_row(row, mapping) == expected
There's no default logic in my function. The AI invented it. Multiply that by 30 test cases and you’ve got a test suite that gives false confidence.
What I Tried (That Didn't Work)
- Better prompts – "only test what exists, don't add behavior". Still got hallucinations.
- One-shot with examples – gave a full manual test as example. Better, but still missed edge cases.
- Chain-of-thought prompting – "think step by step". Result: long, useless commentary and still wrong tests.
- Fine-tuning – too expensive for a one-off project. And you still need curated training data.
What Finally Worked: The Iterative, Verifiable Approach
The key insight: LLMs are great at generating code in a constrained environment where you can verify correctness automatically.
So instead of asking for tests directly, I:
- Extract function signatures and docstrings (including type hints).
- Feed the AI one function at a time – keep the context narrow.
- Generate test stubs with a specific schema (given a function, produce a list of test cases with input/output pairs).
- Validate the generated tests by running them against the real function. If any test fails, I know it’s hallucinated – I log it and discard it.
Here’s a simplified version of the script that does this:
import ast
import inspect
import json
from openai import OpenAI # or any other LLM
client = OpenAI(api_key="sk-...") # in real life, use env var
def generate_test_cases(func_source: str) -> list:
"""Ask LLM to produce test cases as JSON objects."""
prompt = f"""Given this Python function, produce a JSON array of test cases. Each test case is an object with "args" (list), "expected" (value), and "description" (string). Only produce tests that directly verify the function's behavior as written. Do NOT add any behavior not in the code.
{func_source}
Return only valid JSON, no markdown."""
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0.2
)
text = response.choices[0].message.content
try:
return json.loads(text)
except:
print("Failed to parse JSON, skipping")
return []
def validate_and_run(func, test_cases):
"""Run each test case. Return only those that pass."""
good = []
for tc in test_cases:
try:
result = func(*tc["args"])
if result == tc["expected"]:
good.append(tc)
else:
print(f"FAIL (mismatch): {tc['description']}")
except Exception as e:
print(f"FAIL (error): {tc['description']} - {e}")
return good
def generate_tests_for_function(func):
source = inspect.getsource(func)
candidates = generate_test_cases(source)
passed = validate_and_run(func, candidates)
# Now produce pytest-style tests from the validated cases
test_code = []
for i, tc in enumerate(passed):
args_repr = ', '.join(repr(a) for a in tc['args'])
expected_repr = repr(tc['expected'])
test_code.append(f"def test_case_{i}():\n assert {func.__name__}({args_repr}) == {expected_repr}")
return '\n\n'.join(test_code)
Crucially, I run the generated test cases against the actual function. If the AI invented a default value, the test fails at validation and is thrown out. The tests that remain are guaranteed to pass – and they actually test real behavior.
The Trade-offs
- No edge-case discovery – The AI only tests what it sees. It won't find weird inputs you didn't think of. You still need human analysis for boundary values, nulls, etc.
- Brittle to complex logic – For deeply nested functions with many branches, the AI often produces only the happy path. My validation loop only catches failures, not missing coverage. I need to review coverage reports afterward.
- Requires good type hints – Without them, the AI guesses argument types and frequently hallucinates incompatible inputs.
- Cost – Each function call costs a few cents. For a large project, that adds up (but still cheaper than my salary for the same work).
When NOT to Use This
- If your codebase has zero tests and you need to ship yesterday, this workflow will slow you down. Just write the critical tests by hand.
- If the function touches I/O (network, files) – the AI will generate tests that assume mocking, which I haven't automated here.
- If your functions are huge (more than 30 lines) – break them down first. The AI loses context with larger chunks.
What I’d Do Differently Next Time
-
Add coverage analysis – Run
coverage.pyon the generated tests and report what branches are missed. Feed that back to the AI for a second pass. -
Use a tree-sitter AST parser to extract function signatures more reliably than
inspect.getsource(especially for class methods). - Parallelise validation – Running tests one by one is slow. I’d batch them and use subprocess to run a temporary pytest suite.
It’s Not Magic, But It’s Useful
I now use this approach for any new module where I need 80% coverage fast. The AI writes the boring mapping tests, and I only have to think about the tricky edge cases. It still feels like cheating – but in a good way.
Has anyone else found a reliable pattern for getting LLMs to generate trustworthy tests? What's your setup look like?
Top comments (1)
The validate and discard loop is smart. I've been bitten by hallucinated tests more times than I'd like to admit they're worse than no tests because they give false confidence. Going to steal this pattern, thanks for sharing!!