DEV Community

Cover image for I Let AI Write 100% of My Code for 30 Days — Here's What Broke
TheBitForge
TheBitForge

Posted on

I Let AI Write 100% of My Code for 30 Days — Here's What Broke

The thing that started it was a pull request review. My senior dev left a comment that said, simply, "where did this pattern come from?" and I realized I had absolutely no memory of writing that function. I'd been using Copilot and Claude heavily for a few months — not exclusively, just as assistants — and somewhere in the paste-and-tweak cycle, I'd lost the thread. I wasn't sure if I'd written the logic, adapted it, or just... accepted it. That bothered me more than I expected. So I decided to find out what would happen if I went all the way: thirty days, every line of production logic coming from an AI. No sneaking in my own loops. No "just this once I'll write the helper function myself." Whatever the model gave me, I shaped with prompts and pasted into the repo.

The rules were specific, because vague experiments produce vague conclusions. I could write comments, docstrings, commit messages, and prompts. I could restructure and rename. But I couldn't author logic — no manually written conditionals, no hand-rolled algorithms, no "I'll just type this real quick." The model had to generate the actual code. I was using Cursor as my primary environment, leaning on Claude for longer architectural questions and Copilot for inline completions. Early on, this felt almost too easy. I remember thinking, halfway through day two, that I might have underestimated how capable these tools had become. That thought aged about as well as milk in a power outage.

Week one was genuinely fun, and I don't say that sarcastically. I was building a small internal tool — a REST API that aggregated data from a few internal services, normalized it, and exposed endpoints for a dashboard. Boring stuff, honestly. Exactly the kind of work where AI absolutely shines. I prompted Claude for a pagination utility that handled cursor-based navigation across two different upstream response formats, and it came back with something clean and well-commented on the first try. I asked Copilot to scaffold the Express route handlers and they were... fine. Better than fine. They followed the patterns already in the codebase, they named things sensibly, and they included basic error handling I probably would have skipped in a first pass. I felt like I'd hired a very fast, very obedient junior developer who never needed context on the business domain because they just inferred it from the existing files. I was shipping fast. Standups were short. I was, briefly, a genius.

The cracks showed up mid-week two, and they showed up as a race condition I didn't catch until staging.

The API had a write path that triggered a cache invalidation and then immediately queued a background job to repopulate the cache. Under normal load, fine. Under concurrent requests — specifically when two writes hit within the same ~200ms window — the job queue would sometimes process a stale read before the write had fully committed. The result was a cached value that was wrong in a way that was really hard to reproduce: it happened maybe one in fifteen concurrent tests, and only when the database was under light load (heavy load added enough latency that the timing issue disappeared, which made it even more confusing to debug).

I asked Claude to diagnose it. It told me, confidently, that the problem was in my cache key generation and suggested I add a hash of the request body. I changed the cache key logic. The bug persisted. I asked again with more context. It suggested I add a lock around the cache write. I did. Bug persisted. Over three days I got five different confident, plausible, wrong answers — and each one required me to understand enough about the codebase to implement the suggestion, which was genuinely hard because I hadn't written most of the codebase. By day fifteen I was doing something I hadn't done in years: printing out code and annotating it with a pen, trying to rebuild a mental model of a system that was technically mine but felt like someone else's work. Which, in a meaningful sense, it was.

That feeling got worse in week three. Not dramatically worse — there was no single catastrophic moment — just a slow, building unease about what I actually understood. I'd look at a module and know what it did without knowing why it was structured that way. The AI had made architectural decisions while I was focused on prompting: it had introduced an abstraction layer around the database client that wasn't necessary for the current scale but would theoretically help later. Fine in isolation. But then a different prompt session had generated code that bypassed that layer in one specific case, for reasons I couldn't reconstruct, and now there was an inconsistency baked into the codebase with no trail of reasoning behind it.

A teammate asked me to walk her through the service before she picked up a ticket. I started explaining and hit a wall about four minutes in. I could describe the data flow. I couldn't explain why we'd chosen to handle retry logic the way we had, or why the error taxonomy was structured the way it was. The AI had made those calls in the context of individual prompts, without any persistent understanding of the broader system. I'd accepted them without interrogating them. At standup I described a module as "working how you'd expect" and immediately wondered if I actually knew what that meant.

The moment that really crystallized it: I was reading through the auth middleware and found a token validation step that I couldn't, at first, figure out why it worked. The logic was correct — tests passed, production was fine — but the sequence of operations wasn't what I would have written and wasn't what I'd have reached for if you asked me to describe how JWT validation should work. I spent twenty minutes convincing myself it was correct before I just... moved on. That's not a healthy relationship with your own code.

Week four is where the experiment stopped being about AI and started being about me. I kept the rule technically intact — I still wasn't hand-authoring logic — but I changed how I was prompting. I started writing detailed comments first: what the function needed to do, what edge cases mattered, what it shouldn't touch. Then I'd let the AI fill in the implementation, and I'd read every line before it left my editor. Not skim. Read. I was editing like someone working with a draft from a talented but context-deaf collaborator — keeping the good parts, pushing back on the weird choices, asking follow-up questions in the prompt until the output made sense to me before it went into the repo. It was slower than week one. It was faster than my normal workflow. And critically: when I looked at the code afterward, I understood it.

So what actually broke, across the full thirty days? Debugging depth was the first casualty. The AI is excellent at generating plausible diagnoses but has no ability to hold the full state of a running system in its head the way a developer who wrote the code can. It diagnoses from pattern-matching against known bug types, which means it's great at common bugs and confidently wrong about novel ones. Code ownership evaporated faster than I expected — not just as a feeling, but practically, in my ability to reason about the system under pressure. Architecture suffered because AI makes local decisions well and global decisions poorly; it optimizes for the function you're currently writing without remembering what it suggested three sessions ago, which is how you end up with inconsistent patterns living side by side in the same repo. Context window limitations created a specific kind of drift: the longer the project ran, the more the AI was working from partial information, and the more it defaulted to generic solutions that didn't quite fit the specific shape of the codebase. And there was a consistent tendency to over-engineer simple things — a three-line data transform that came back as a class with four methods and an interface, because the AI was pattern-matching against enterprise-grade examples in its training data rather than reading the room.

Here's what I actually changed after the experiment ended: I prompt before I type, but I read before I commit. I use AI the way I use Stack Overflow on a good day — as a fast path to a plausible answer that I then verify and adapt rather than paste and ship. I write the comments first, almost always. When something the AI generates is surprising, I stop and figure out why it works before I move on, because "it passes the tests" is not the same as understanding it, and understanding it is what lets you debug it at 11pm when the tests aren't running. I'm faster than I was before the experiment. I'm also more deliberate than I was in the weeks leading up to it, when I was half-vibe-coding without fully admitting it.

The honest summary is that AI is a very good first draft and a terrible final authority. The developers who'll do best with these tools are the ones who stay in the author's seat — which means staying in the critic's seat too, asking "why does this work" about their own code, and not outsourcing that question along with the typing.

Knowing what's in your codebase isn't a soft skill. It's the job.

Top comments (0)