Alan West

Posted on May 10

Why AI-Generated Code Makes You Slower (And How to Fix Your Workflow)

#ai #productivity #testing #python

You've probably felt this. The first week you wired an AI assistant into your editor, you shipped twice as much. By month three, you were back to your old pace — except now you were debugging weirder bugs.

I've been using AI assistants in my daily workflow for about two years across four projects. The pattern keeps showing up: the productivity gains are real but front-loaded, and they erode unless you change how you work. Most of that erosion comes from one specific, fixable problem.

The Problem: Plausible Code That Doesn't Actually Work

The bug I see most often isn't an obvious syntax error. It's when generated code calls a function, method, or config option that looks exactly like something the library would have — but doesn't.

Last month I was building a CSV import feature and the assistant happily produced this:

import pandas as pd

# Read CSV with progress reporting — looks reasonable, right?
df = pd.read_csv(
    "users.csv",
    on_progress=lambda pct: print(f"Loading: {pct}%"),  # this kwarg does not exist
    chunksize=10_000,
)

on_progress is not a real parameter on pd.read_csv. The code was syntactically valid Python, my linter didn't complain, and the failure mode was... silent. The kwarg got swallowed and the import ran without any progress reporting. I only noticed because a user pinged me saying the loading bar wasn't moving.

This is the core issue. AI-generated code is plausible in a specific, dangerous way: it pattern-matches the shape of real APIs, which is exactly what makes it hard to spot in review.

Root Cause: How Hallucinations Slip Through

Three things conspire here:

Pattern-matching beats correctness. The model has seen thousands of pd.read_csv calls. It has also seen progress callbacks on other I/O functions. Stitching them together produces code that looks right without being right.
Type checkers often can't save you. Many libraries use **kwargs, dynamic dispatch, or duck typing. Static analysis won't flag a non-existent keyword argument that flows through **kwargs.
Reviewer fatigue. When the surrounding code is correct and the function name is real, your eyes glide over the made-up parameter. After 200 lines of mostly-good output, you stop reading carefully.

The deeper issue is a workflow one. If you're prompting for a feature and pasting the result, you've outsourced generation but kept full responsibility for verification — and verification is harder on code you didn't write, because you don't have the mental model the author would have.

The Fix: Force Verification Into the Loop

Here's the workflow I switched to after enough of these bites. The core idea: don't accept code unless something other than your eyes has touched it.

Step 1: Generate the test first

Before generating the implementation, write (or generate) a test that exercises the specific behavior you want. This pins the behavior to something runnable.

# tests/test_import.py
from myapp.importer import load_users

def test_load_users_reports_progress():
    progress_log = []

    # The whole point of the feature: progress callbacks fire
    result = load_users(
        "tests/fixtures/users.csv",
        on_progress=lambda pct: progress_log.append(pct),
    )

    assert len(result) > 0
    assert progress_log, "expected at least one progress update"
    assert progress_log[-1] == 100

If the implementation hallucinates an API, the test fails immediately with a real error message — usually TypeError: unexpected keyword argument. Way cheaper than debugging in production.

Step 2: Run code, don't just read it

Add a pre-commit hook that blocks commits when tests fail. Yes, this is obvious. Yes, most teams I've worked with don't actually enforce it.

# .pre-commit-config.yaml
repos:
  - repo: local
    hooks:
      - id: pytest-fast
        name: pytest (fast suite)
        entry: pytest -x -m "not slow"  # -x: stop on first failure
        language: system
        pass_filenames: false
        always_run: true

The point isn't catching every bug. It's catching the plausible-but-wrong ones the moment they hit your branch, before they pile up into a multi-hour debugging session two weeks later.

Step 3: Pin the dependency surface

A surprising amount of hallucination happens because the model assumes a different version of a library than you have installed. Lock your versions and tell the assistant which version you're on:

# pyproject.toml
[project]
dependencies = [
    "pandas==2.2.3",    # exact pin, not >=
    "pydantic==2.9.2",
]

When you prompt, include the version. "Using pandas 2.2.3, write a CSV importer with progress reporting" gets you closer to reality than the same prompt without the version, because the model will at least try to constrain its API recall.

Step 4: Prefer narrow prompts over broad ones

Long, multi-feature prompts produce code where errors compound. I get better results asking for one function at a time, with clear inputs and outputs:

Function signature:
    def parse_user_row(row: dict) -> User: ...

Requirements:
- Strip whitespace from email
- Reject rows where email is missing or invalid
- Return User(email=..., name=..., created_at=...)
- Raise InvalidRowError on bad data, do not log

Use only the standard library and pydantic 2.9.

Narrow scope, explicit constraints, named version. My hallucination rate drops noticeably with this format.

Prevention: Build Habits, Not Heroics

A few things I now do reflexively:

Read the imports first. If the generated code imports something you didn't ask for, that's a yellow flag. Verify the import path exists in your installed version before reading further.
Distrust convenience parameters. When a function call has a kwarg that feels suspiciously just right for your problem, look it up in the docs. That's the highest-probability hallucination spot.
Treat "looks correct" as a smell. If you read 30 lines of generated code and have zero questions, you didn't read carefully. There should always be at least one thing to verify.
Keep your test runtime fast. If your full suite takes eight minutes, you'll skip running it. Sub-30-second feedback loops are what actually keep this workflow honest.

So, More Work or Less?

After two years, my honest answer is: roughly the same amount of work, but distributed differently. Less typing, more reading. Less greenfield design, more verification. The people I see losing time to AI tools are the ones who didn't shift the verification load anywhere — they just trusted the output and inherited a slower debugging tail.

The tooling won't fix this for you. The workflow will.

DEV Community