Max Kryvych

Posted on May 18

Lessons from three months of vibe coding (and a complexity score of 58)

#ai #productivity #devops #codequality

TL;DR — I spent three months vibe-coding a side project with an AI agent. It felt fast and productive until I opened a file and saw 3000 lines, repeated helpers copied ten times, and one function with cyclomatic complexity 58, driven by roughly 53 branches. The agent was not the root problem. The missing feedback loop was. This is what I should have set up on day one, and what I now consider mandatory when working with AI agents on code I may need to maintain.

uv run complexipy . --top 10
./rounds/views.py
    current_state 58 (last: 52, Δ = +6)  ❌ FAILED
    host_advance_post_round 45 (last: 9, Δ = +36)  ❌ FAILED
    host_rewind_round 37 (new, Δ = +37)  ❌ FAILED
    audience_panel 34 (last: 35, Δ = -1)  ❌ FAILED
    debate_vote_results 18  ❌ FAILED
    host_create_game 16  ❌ FAILED
    _debate_winner_details 12  ✅ PASSED
    audience_prompt_submitted 11  ✅ PASSED
    host_select_vote_winner 11  ✅ PASSED
    preview_screen 11  ✅ PASSED
Failed functions:
 - ./rounds/views.py: audience_panel, current_state, debate_vote_results, host_advance_post_round, host_create_game,
   host_rewind_round

The moment it clicked

I was not running a tool. I was not reviewing a PR. I was just poking around the repo to remind myself where something lived, and I opened a file that had grown past 3000 lines.

Inside, I found functions that should have been one shared helper, copy-pasted with one or two lines different each time. Every time I had asked the agent to add a feature that was "similar but slightly different," it had duplicated the whole thing and tweaked the parts that varied.

I ran radon out of curiosity. One function clocked in at cyclomatic complexity 58. Roughly fifty-three branches in a single function.

That was the moment I realized I had a problem I could not easily fix.

How I got there

I want to be honest about how this started, because I think it is the most common path.

It was a small idea: a Django side project. I knew how to code; this was not me learning Python from scratch. But I was learning how to work with an AI agent. I did not know the workflows, the tooling around agents, what good guardrails looked like, or how other people were structuring this kind of work. So I was figuring it out as I went.

I told myself I was doing spec-driven development. I would write a description of what I wanted, hand it to the agent, and review the output.

In reality, I was vibe coding.

I was not actually reading the diffs. I was checking that the feature worked — does the endpoint return the right thing, does the test pass — and moving on. The agent shipped, I accepted, repeat.

For weeks it felt great. Features landed. Tests were green. I had momentum.

The slow collapse

The problem was not a single bad commit. The problem was a thousand small, locally rational choices.

Every time I asked for a feature that was "like the last one but with a small change," the agent did not extract a shared helper. It copied the previous implementation and edited it. That is the safest move for the agent in any given turn: copying working code is less likely to break than refactoring. Without anything pushing back, copying wins every time.

The if/elif chains kept growing. New cases got appended, not refactored into a dispatch table, strategy, command handler, or state machine. Branches piled on branches. One function ended up at CC 58.

The CSS had the same failure mode. New components got their own styles, copied from the previous component and slightly tweaked. There was no design system, no shared tokens, and no enforced layering. It just sprawled.

By the time I noticed, two things were true at once:

The codebase was big enough that I could not hold it in my head anymore.
I was afraid to touch it myself. Not because I did not know how, but because any change risked breaking something I could no longer reason about.

The agent had become the only thing still willing to navigate the mess, and it was the thing that had created the mess.

That is the worst place to be. The point of using an AI agent is to amplify what you can do. If you end up unable to refactor your own code without the agent, you have not amplified anything. You have outsourced something you cannot take back.

The actual lesson

Here is what I missed: an agent may have statistical priors about what good code often looks like, but it does not have a durable, enforceable model of your architecture unless you encode one.

It does not reliably know your conventions, your boundaries, your standards, or the tradeoffs behind your existing code. It does not get tired. It does not feel discomfort when a file grows to 3000 lines. It does not decide to refactor unless the environment tells it that the current shape is unacceptable.

I thought I was giving it feedback. I was not.

"Make it cleaner" is not feedback an agent can act on reliably.

"Reduce complexity" is not enough either.

The agent needs feedback that is specific, machine-readable, and automatic: the kind you get from a linter, a type checker, an architectural test, a duplication detector, or a failing complexity threshold.

When I had no feedback loop, the agent did what an unsupervised junior developer might do under pressure: it solved each problem in the easiest local way.

Copy-paste. Append a branch. Skip the refactor. Keep the tests green. Move on.

Each move was rational on its own. The aggregate was structural debt.

The reframe that finally landed for me was this:

The agent behaved correctly given the environment I gave it. The environment was the bug, not the agent.

Why quality gates are non-negotiable now

Quality gates — linters, type checks, complexity limits, architectural fitness functions, duplication detectors — were already good engineering practice before AI. You could ship without them, but you paid for it later.

With AI-assisted development, the cost curve changes.

The rate of code production is now far higher than your rate of code comprehension. Your mental context window did not grow when you started using an agent. The codebase's growth rate did.

The agent has no reliable memory of your conventions across sessions. Anything you do not encode in tooling, tests, docs, or project rules will drift.

The agent has no inherent incentive to refactor. Refactoring increases diff size and risk. Copying is cheaper. Without a gate, entropy wins.

And "just review it carefully" does not scale as the primary control. Human review still matters, but it should not be spent catching formatting drift, obvious duplication, missing imports, or unchecked complexity. Those should fail before review. Review should focus on behavior, architecture, data modeling, naming, and whether the change should exist at all.

So gates are not optional anymore. They are prerequisites. You set them up before the agent writes a line, not after.

The time spent configuring them is not overhead. In AI-assisted development, defining the feedback loop is part of the engineering work.

Matt Pocock frames this as "outrunning your headlights": AI generates code faster than you can verify it, and the only way to stay inside the beam is the same discipline that always kept engineers honest — incremental delivery, test-first development, and structures you can reason about. The gates are what put the headlights back on.

When the agent produces something that violates a gate, the failure is automatic, specific, and machine-readable. You feed that failure back to the agent. The agent fixes it. You move on.

That is the feedback loop I was missing for three months.

A small example: from branch pile to dispatch table

This is the kind of shape I let accumulate:

async def current_state(event_type: str, payload: dict) -> dict:
    if event_type == "round_started":
        # validate payload
        # load objects
        # update state
        # return response
        ...
    elif event_type == "vote_submitted":
        # validate payload
        # load objects
        # update state
        # return response
        ...
    elif event_type == "debate_finished":
        # validate payload
        # load objects
        # update state
        # return response
        ...
    # 50 more branches later...

Each branch looked harmless when added. The function only became obviously wrong after enough small additions accumulated.

The refactor I should have forced much earlier was boring and mechanical:

from collections.abc import Awaitable, Callable
from typing import Any

Handler = Callable[[dict[str, Any]], Awaitable[dict[str, Any]]]

EVENT_HANDLERS: dict[str, Handler] = {
    "round_started": handle_round_started,
    "vote_submitted": handle_vote_submitted,
    "debate_finished": handle_debate_finished,
}

async def current_state(event_type: str, payload: dict[str, Any]) -> dict[str, Any]:
    try:
        handler = EVENT_HANDLERS[event_type]
    except KeyError as exc:
        raise UnknownEventType(event_type) from exc

    return await handler(payload)

Then each handler gets its own unit tests, and the dispatch function stays simple.

The point is not that every conditional should become a dict. The point is that the refactor needs a name, a target shape, and a test strategy. "Clean this up" is too vague. "Replace this 58-branch conditional with a dispatch table and one tested handler per event type" is something an agent can execute.

The checklist I wish I had on day one

I now think about this in tiers. Pick the tier that matches the project's expected lifespan.

If I may throw it away next week, I use formatting, linting, tests, and one obvious check command.

If I may keep it for more than a month, I add type checking, duplication detection, CI, and architecture rules.

If a team or production system depends on it, I add architectural tests, security scanning, dependency scanning, mutation testing on critical paths, and complexity tracking over time.

Tier 0 — Prototype guardrails

Set up before the agent writes a single line.

pyproject.toml with ruff for linting and formatting. Enable complexity rules (C901, max-complexity around 10), bugbear (B), simplify (SIM), pyupgrade (UP), and a sensible subset of pylint (PL).
A minimal test suite with one obvious command to run it.
A task runner such as mise, just, make, or a simple script with check, test, and fix tasks. The agent needs one obvious command to run.
A short CONTRIBUTING.md, AGENTS.md, or .cursorrules file stating the rules: max function length, max complexity, layering rules, no business logic in transport handlers, no persistence calls outside the persistence boundary, no copy-paste without extracting a function or asking first.

Tier 1 — Sustained project guardrails

Recommended for anything you expect to touch a month from now.

Strict type checking with mypy, pyright, or ty for your own modules.
Pre-commit hooks wired to linting, formatting, type checking, and tests where practical.
pytest with coverage gated at a real number, usually 60–80%, not an aspirational 100%.
A duplication detector such as pylint --enable=duplicate-code or jscpd.
bandit or equivalent tooling for common security smells.
CI that runs all of this on every PR and blocks merge.
A single integration test that boots the app and hits /health. This catches many "the agent broke imports" failures.
A small agent skill library that encodes your architecture decisions as workflows, not just rules. Rules tell the agent what not to do. Skills tell it what to reach for.

Tier 2 — Production and team guardrails

Recommended when other people, users, or revenue depend on the code.

Architectural tests. Python lets anything import anything; you need to enforce lanes explicitly. Options include import-linter, pytest-archon, or deply.
Mutation testing on critical modules with a tool such as mutmut. Agents can write tests that pass without meaningfully testing behavior.
Dependency scanning with pip-audit or equivalent.
Secret scanning with gitleaks or equivalent.
Container and infrastructure checks where relevant, such as hadolint for Dockerfiles.
Complexity tracked over time with radon, xenon, or a similar tool. Fail on regressions, not just absolute thresholds.
Contract tests against your API spec, for example with schemathesis if you expose OpenAPI.
Authorization tests for sensitive routes and workflows. Do not only test that endpoints work; test who is allowed to call them.

Tooling principles, not just tool names

The specific tools matter less than the feedback they produce.

Principle	Example Python tools
Format and lint automatically	`ruff`
Type-check boundaries	`mypy`, `pyright`, `ty`
Prevent copy-paste growth	`pylint duplicate-code`, `jscpd`
Enforce architectural imports	`import-linter`, `pytest-archon`, `deply`
Track complexity	`radon`, `xenon`, `complexipy`
Catch dependency risk	`pip-audit`, `uv audit`
Catch security smells	`bandit`, `gitleaks`, `hadolint`
Validate API contracts	`schemathesis`

The point is not to install every tool in the table. The point is to decide what kind of failure you want the agent to see before bad code lands.

How to give the agent useful feedback

This is the part that changed my workflow most.

Do not say:

This code is messy. Clean it up.

Say:

Here is the output of ruff check app/services/user_service.py. Fix every finding without using # noqa. If a fix changes a function signature, list the call sites first and propose the change.

Do not say:

Reduce complexity.

Say:

This function dispatches on event_type with a 58-branch if/elif chain. Replace it with a dict mapping event_type to handler functions, one handler per branch, each unit-tested.

Name the refactor. Cite the tool output. Constrain the solution. Require tests. That is how you get a refactor instead of a rename.

The tooling in my reference project

After the Django project, I started a separate FastAPI reference project specifically to encode what I had learned.

Different framework, same principles. The point is not FastAPI vs. Django. The point is what the setup looks like before feature code lands.

You can find it here:

https://github.com/maxkrivich/fast-api-reference-project

A quick note before the tool list: you do not need enterprise tooling for this. If your company already pays for SonarQube, Codacy, Snyk, or similar tools, use them. But every gate I am describing can be implemented with open-source tools. The entire point — feeding structured failures back to an agent — works with ruff and mypy just as well as it does with commercial SaaS.

Do not let "we do not have SonarQube" become the reason you skip the gates.

Here is what is in the repo, briefly:

ruff — single tool for linting and formatting, fast enough to run on save. Complexity is gated at max-complexity = 10, which would have caught my CC 58 function immediately.
ty in strict mode — Astral's type checker. It catches a large class of agent mistakes: wrong types, missing returns, and unhandled Optional values.
prek — Rust-based git hooks manager, similar to pre-commit but faster and without a Python dependency. It runs ruff, bandit, import-linter, pylint, gitleaks, and hadolint on every commit.
mise — one entry point for tasks and tool versions. The agent runs a single command, such as mise run full-ci-check, and gets a structured report it can act on.
pytest + coverage — the baseline. Gated at a real number, not an aspirational one.
bandit — common Python security smells. It catches agent moves like subprocess.run(..., shell=True) with user input.
pip-audit / uv audit — dependency CVEs. Agents will happily pin a vulnerable version if you do not check.
import-linter — enforces architecture through import contracts. This makes the build fail if a router imports a SQLAlchemy session directly or a domain model touches an HTTP adapter.
pylint duplicate-code checks — flags copy-paste before it accumulates.
radon / xenon — complexity tracking. The goal is to fail on regressions, not only absolute thresholds.
GitHub Actions — every PR runs the full gate set; nothing merges that has not passed.

What gates will not catch

Honest disclaimer: I do not want to oversell this.

Gates catch mechanical slop: complexity, duplication, style, types, import violations, missing tests, and surface-level security issues.

They do not design the system for you.

They do not catch a bad product decision, a wrong abstraction, a leaky domain model, a service that should have been three services, or a missing concept in the data model.

What gates do is reclaim your attention. Once mechanical quality is automated, code review can shift from "is this clean?" to "is this the right shape?"

That is a much better use of human attention, and it is the only kind of review that is actually worth your time when working with an agent.

You still have to think. The gates just stop you from drowning while you do it.

Where I am now

The project that hit CC 58 is at a point where I think it is easier to rewrite than refactor. That is a hard thing to admit, but it is where I am.

The next version starts with the checklist in place before any feature lands.

If there is one thing I would want you to take from this, it is this:

Set up the feedback loop first.

Not after the first feature. Not after the first refactor. First.

The hour you spend configuring ruff, type checking, tests, and pre-commit on day one is the hour that prevents the month you would otherwise spend untangling structural debt on day ninety.

The agent is not the problem.

The environment we give it is.

If you have been through something similar, or have tooling I missed, I would genuinely like to hear it — leave a comment or ping me on GitHub.

DEV Community