DEV Community

kol kol
kol kol

Posted on

I Let AI Write My Backend Code for a Week — Here's What Actually Broke

I told myself it would be fine. I had been using AI coding assistants for suggestions and autocomplete for months — and it worked great. So when a new project came up with a tight deadline, I thought: why not let AI handle the whole backend?

I set up a Cursor workspace, wrote a detailed spec, and hit generate. What followed was 5 days of "it compiles, but..." debugging that taught me more about software engineering than any tutorial ever did.

What Went Surprisingly Well

The boilerplate was genuinely impressive. In about 2 hours, I had:

  • A fully typed Express.js API with 12 endpoints
  • Zod validation schemas for every route
  • A Prisma schema with proper relations
  • Docker compose setup with Postgres and Redis

The code looked clean. Tests passed. I was feeling like a 10x developer.

The Cracks Started Showing

Bug #1: Silent Type Coercion

The AI generated this validation:

const userSchema = z.object({
  age: z.number(),
});
Enter fullscreen mode Exit fullscreen mode

Looks fine, right? Except the API received ages as strings from the frontend. Zod parsed them fine in development (coercion worked). But in production with stricter mode? NaN everywhere. Users were getting 400 errors on signup.

Fix: z.coerce.number().int().positive() — but I had to find all 23 instances manually.

Bug #2: The N+1 Query Nobody Asked For

For a dashboard endpoint that listed users with their orders and order items, the AI generated:

const users = await prisma.user.findMany();
for (const user of users) {
  user.orders = await prisma.order.findMany({ where: { userId: user.id } });
}
Enter fullscreen mode Exit fullscreen mode

Classic N+1. The Prisma docs literally have a page titled "How to avoid N+1 queries." With 500 users, this endpoint made 501 database queries and took 8 seconds.

Fix: include with nested relations — one query, 120ms.

Bug #3: Race Conditions in Token Refresh

The AI wrote a token refresh flow that looked perfect in isolation. But under load, concurrent refresh requests would invalidate each other's tokens. The AI's solution? "Add a retry mechanism." My solution? "Use a refresh token rotation pattern that handles concurrency properly."

Bug #4: The Error Handler That Swallowed Everything

catch (error) {
  console.log("Error:", error);
  res.status(500).json({ error: "Something went wrong" });
}
Enter fullscreen mode Exit fullscreen mode

console.log doesn't serialize Error objects properly. Every production error was just {} in the logs. We ran like this for 3 days before anyone noticed.

Fix: console.error with proper error serialization and a proper logging library (we went with Pino).

The Real Problem

Here's what I learned: AI generates code that's correct in isolation but fragile in context.

It doesn't know:

  • Your deployment architecture (so it misses N+1 queries)
  • Your traffic patterns (so it ignores race conditions)
  • Your logging infrastructure (so it uses the wrong logger)
  • Your team's conventions (so it mixes patterns)

The generated code passes tests because tests are narrow. It compiles because the syntax is valid. But production is where context matters.

What I Changed

  1. AI writes the first draft, humans write the final version. I'm not going back to writing everything from scratch, but every PR now requires a manual review of control flow, error handling, and data access patterns.

  2. Architecture decisions stay human. Schema design, caching strategy, and error handling patterns are too context-dependent to outsource.

  3. Add integration tests that AI can't fake. Unit tests pass. Integration tests reveal the gaps. We added a test suite that runs the full API against a real Postgres instance.

  4. Observability from day one. Structured logging, request tracing, and error tracking are now part of the project template, not an afterthought.

The Bottom Line

AI didn't break my project. My assumption that "generated code equals production-ready code" did.

AI is an incredible force multiplier when used as a pair programmer. It's a liability when treated as a replacement for engineering judgment.

The week cost me 3 extra days of debugging, but I shipped a more robust system than I would have built alone — because the AI's mistakes taught me where my own blind spots were.

Use AI. But keep your hands on the wheel.


Have you had similar experiences with AI-generated code? I'd love to hear your war stories in the comments.

Top comments (0)