DEV Community

Cover image for I let Claude Code write an entire feature for a week. Here's what actually broke.
Tejas Giri
Tejas Giri

Posted on

I let Claude Code write an entire feature for a week. Here's what actually broke.

I've been building things with AI for about a year now. Autocomplete, chat, the usual. But last month I decided to try something different.

I picked one feature on a side project I'm working on and handed the entire thing to Claude Code. Not "help me write this function." Not "review this PR." The whole feature. Architecture, schema, API, UI, tests. For one full week, I didn't write a single line by hand. I only reviewed, ran, and prompted.

I wanted to know how far this actually goes when you take the training wheels off. I'm not here to sell you on AI or scare you about it. I'm telling you what I saw.

The setup

The feature was a notification system. User actions trigger events, events get processed, some go to email, some go to in-app. Standard stuff. I'd built something similar before so I knew what good looked like.

Stack was Next.js, Postgres, a queue. I gave Claude Code access to the repo and described what I wanted. From there I tried to act like a tech lead reviewing a junior dev's work, not a developer writing code.

Days 1 and 2 felt like cheating

The first two days were genuinely shocking.

I described the feature in maybe 200 words. Within an hour I had a working schema, migration files, the events table, a worker that processed jobs, and a basic API. It even wrote tests. Most of it ran on the first try.

I kept thinking about how long this would have taken me solo. Probably two days of just typing. Schema decisions, naming columns, writing the boilerplate worker code. Done in a morning.

This is the part everyone shares on Twitter. The "look how fast" part. It's real. I'm not going to pretend it isn't.

But that's also where most articles stop. So let me keep going.

Day 3 — the silent bug

By day 3 I had the full feature working end to end. Events firing, getting queued, processed, emails going out. I tested it manually. Looked good. Pushed it to staging. Looked good there too.

Then I noticed something off. Some notifications were taking 4 seconds to send. Others were instant. Same code path. Same payload size. Same queue.

I asked Claude Code to investigate. It looked at the code, made some plausible suggestions about the queue config, the database connection pool, retry logic. Nothing fixed it.

I spent the next 4 hours debugging this myself. Reading code I hadn't written. Adding logs. Watching traces. The bug was small and brutal.

In one of the worker files, Claude had written something like this:

const result = await processEvent(event)
await markAsProcessed(event.id)
sendNotification(result)
Enter fullscreen mode Exit fullscreen mode

Spot it? sendNotification was missing the await. So sometimes the function finished before the notification actually went out, the worker moved on, and the notification got delayed because Node was busy with the next job.

Locally this never showed up because nothing else was happening on the event loop. In staging with real traffic, it caused random 4-second delays.

That single missing await cost me half a day. And here's the thing that bothered me if I had written that code myself, I would have seen it instantly. Because I'd have been thinking about whether the call was async while writing it. I wasn't thinking. I was reviewing. Reviewing is a different mental mode than writing, and you miss things.

Day 4 — the rewrite loop

This is where the cracks really showed.

I asked Claude Code to add a feature on top of the existing system. Notification preferences per user. Standard stuff. It started writing.

Halfway through, I noticed it had subtly rewritten parts of the worker I'd already approved on day 2. Not the same code with new features added. Different code that did roughly the same thing. New variable names. Slightly different error handling. A different retry strategy.

I asked why it changed those. The answer was something polite about "improved consistency." But the old code was fine. I'd already tested it. Now I had to re-test the parts that worked yesterday because they weren't the same code anymore.

This kept happening for the rest of the day. Every new request seemed to drag in unrelated changes. I started noticing how much of my time was going into "why did you change that, change it back" instead of actually moving forward.

A senior dev I trust once told me the hardest skill in code review isn't finding bugs. It's noticing what changed that shouldn't have. With AI doing the writing, that skill becomes the entire job.

Day 5 — confidence with no basis

The most unsettling moment came on Friday.

I asked Claude Code to add rate limiting to the notification API. It wrote the code. Confident, clean, well-named functions. Used a Redis-based sliding window. Beautiful.

Except we weren't using Redis in this project. We never had been. I'd never mentioned Redis. There was no Redis client installed. It just decided we used Redis because that's what most rate limiters in its training data use.

When I pointed this out it apologized and rewrote it with a simpler in-memory approach. Fine. But I sat there for a minute thinking what if I hadn't caught that? What if I'd just run npm install redis and added a Redis instance to my infrastructure because the AI suggested it? On a different team, a less careful reviewer would absolutely have shipped that.

The code itself was good. The assumption underneath it was completely fabricated. And the confidence was the same either way.

What it's actually good for

I want to be fair here because I don't want to write another doom post. By the end of the week I had a real working feature shipped. That's not nothing.

What it was actually great at:

The early scaffolding when there's no context to lose. Schema design, initial file structure, the boilerplate parts. The first 60 percent was genuinely faster.

Tedious but mechanical work. Writing test cases for code I'd already approved. Generating types from a schema. Converting one shape of data to another. Migration files. README updates. All the things that drain a developer's day without engaging their brain.

Exploration and unfamiliar territory. When I needed to integrate with a service I'd never used, having something that could read the docs and write a first attempt was useful. I'd learn faster from fixing its mistakes than from staring at empty documentation.

What it was bad at

Anything where context mattered. The longer the feature got, the more it forgot what we'd already decided. By day 4 I was repeating myself constantly. "We're not using Redis." "Don't change the worker file." "The retry logic is correct, leave it."

Subtle async correctness. The missing await was just the start. I found three more like it that week. Promise chains where the outer code didn't wait properly. Database transactions that weren't transactions because one query escaped the wrapper. Stuff that runs fine in tests and breaks under real load.

Knowing when to stop. A human dev knows when a feature is done. AI keeps suggesting improvements. New features. Refactors. Better names. After a while you spend more time saying "no, this is fine" than actually building.

What changed in how I work

I'm still using Claude Code. Heavily. But I've changed how.

I don't hand it whole features anymore. I hand it pieces. Small enough that I can hold the whole change in my head when I review. Big enough that I'm not just doing autocomplete.

I read every line that gets committed. Not skim. Read. The cost of a missed await is higher than the time saved by not reading.

I keep my own context. The architectural decisions, the naming conventions, what we use and don't use I write these in a single document and reference it in every prompt. Otherwise I'm fighting the same battles every session.

And I never let it write code in areas where the cost of being wrong is high. Auth flows. Payment webhooks. Anything that touches money or sessions. There I write it myself, slowly, paranoid, rereading. AI helps me explore and scaffold. I do the parts where being wrong matters.

The real takeaway

The week wasn't a failure. It also wasn't the productivity miracle the demos sell.

What I actually learned was about myself, not the tool. I noticed how my brain felt different when I was reviewing instead of writing. Less engaged. Less suspicious. More likely to nod and approve.

Writing code is partly how you understand the system. When you stop writing, you stop understanding, and you start trusting. And trust is exactly the wrong mode for shipping software.

The developers who'll do well over the next few years aren't the ones who use AI the most. They're the ones who keep the parts of their brain that catch a missing await even when they're tired, even when the code looks fine, even when the AI sounds confident.

The tool is good. Use it. Just don't let it use you.

Top comments (0)