I spent hours writing unit tests – so I made an LLM do it (and learned what not to do)

#ai #testing #python #productivity

About a month ago I hit that point in a project where the business logic was solid, the API endpoints were clean, but the test file was a pathetic stub. I had 30+ similar validation functions – each one a slight variation on “does this field exist?”, “is it the right type?”, “does it pass this custom rule?”. The manual approach would mean copying the same assert pattern dozens of times, changing only the function name and the test input. My brain started melting just thinking about it.

I’m a big believer in testing, but I’m also a big believer in not doing boring work twice. So I started looking for ways to automate test generation.

What I tried first (and why it sucked)

My first instinct was to write a Python generator that parsed the function signatures and spat out basic asserts. Something like:

def generate_test(func_name, params):
    lines = [f"def test_{func_name}():"]
    for p in params:
        lines.append(f"    assert {func_name}({p}) is not None")
    return "\n".join(lines)

This worked only for the most trivial cases. As soon as the functions had side effects, required fixtures, or needed specific edge-case values, the template became a nightmare of conditionals. Plus, what about the negative tests – the inputs that should raise errors? My generator didn’t know anything about the domain logic.

Next I tried a rule‑based approach with regular expressions. I wrote about 200 lines of heuristics to infer parameter types from docstrings. It sort of worked for one function, then broke completely on the next. I felt like I was rebuilding a tiny compiler for a language nobody uses.

The approach that actually worked

I had a hunch that an LLM could do better if I gave it the right context. The idea was simple: feed the function source (plus docstring) into a language model, ask it to produce pytest test functions, and then validate the output before writing it to a file.

Here’s the core loop I ended up with:

import json
import ast
import requests

# For demo purposes – replace with your own endpoint
BASE_URL = "https://ai.interwestinfo.com/v1"  # Example: LLM API
API_KEY = "your-key"

def generate_tests(source_code: str, max_retries=2):
    prompt = f"""You are an expert Python tester. Given the function below, write comprehensive pytest test functions covering:
- Normal cases
- Edge cases (empty, None, large values)
- Error cases (wrong types, out-of-range)

Do NOT use external libraries beyond pytest. Return ONLY valid Python code (no explanations).

Function:

python
{source_code}

"""

    for attempt in range(max_retries):
        response = requests.post(
            f"{BASE_URL}/chat/completions",
            headers={"Authorization": f"Bearer {API_KEY}"},
            json={
                "model": "gpt-4o-mini",  # Or whatever model you prefer
                "messages": [{"role": "user", "content": prompt}],
                "temperature": 0.3,
            }
        )
        response.raise_for_status()
        content = response.json()["choices"][0]["message"]["content"]

        # Validate that the output is parseable Python
        try:
            ast.parse(content)
            return content
        except SyntaxError:
            if attempt == max_retries - 1:
                raise
            continue
    return content  # Fallback (shouldn't happen)

python

The validation step is crucial. LLMs love adding markdown fences, random comments, or incomplete brackets. By parsing the output with ast.parse I catch those before I write bad code to my test file.

Results (and surprises)

I pointed this at a validate_email function with three lines of logic. The LLM returned:

import pytest
from validation import validate_email

def test_valid_email():
    assert validate_email("user@example.com") is True

def test_valid_email_with_plus():
    assert validate_email("user+tag@example.com") is True

def test_no_at_symbol():
    assert validate_email("userexample.com") is False

def test_empty_string():
    assert validate_email("") is False

def test_none_input():
    with pytest.raises(TypeError):
        validate_email(None)

Not bad – it even guessed I wanted a TypeError for None (which my function did raise). I ran the tests and they passed. Success.

But it wasn’t all roses. For a complex function that involved a database query, the LLM generated tests that mocked things incorrectly. It assumed the function would call db.fetch() when in reality it used an async ORM. The generated tests were syntactically valid but semantically wrong.

Lessons learned

Use LLMs for boilerplate, not for domain-specific logic. If your function requires deep knowledge of your database schema or business rules, the generated tests will be too generic. You’re better off hand‑writing those or providing a schema context in the prompt.
Prompt engineering matters more than the model. Adding "Do NOT include imports that don't exist in your project." and "Use pytest.raises for exceptions." dramatically improved the output quality.
Always validate the output. I parse the response with ast.parse and also run a quick pytest --collect-only on the generated file to catch any syntax or import errors before the full test run.
Temperature 0.2 – 0.4 is the sweet spot. Too high and it invents random test cases; too low and it repeats the same pattern ad nauseam.

When NOT to do this

If your test suite requires precise mocking of external services (e.g., AWS, payment gateways). The LLM will hallucinate the mock calls.
If you’re testing performance or concurrency bugs. The model doesn’t understand timings or race conditions.
If your team values extremely consistent naming conventions. The LLM may name tests differently each time.

For my validation functions, this approach saved about 20 minutes per function. Over 30 functions, that’s 10 hours I got back. The generated tests aren’t perfect – I still review every file – but they catch the obvious stuff, which is where many bugs hide.

What I’d change next time

I’d write a small CLI tool that takes a list of function names (or reads a module) and generates a test file for each, then opens a diff viewer so I can accept/reject chunks. That’s the next weekend project.

Now I’m curious: How do you handle the boring parts of testing? Do you use any code generation, or do you just accept the grind? Let me know in the comments – I’d love to steal your ideas.