SAR

Posted on Jul 3

I Tried Building a Real App With AI Agents — The Good, The Bad, and The Hallucinated

#ai #webdev #programming #javascript

You know that feeling when you watch a demo video where someone types "build me a SaaS app" into an AI agent and it spits out a fully functional product in 30 seconds? Yeah, I fell for it too. For about a week.

Then I actually tried using AI coding agents on a real project — not a Todo app, not a counter component, but an actual multi-service app with auth, payments, and a database. And let me tell you, the gap between "demo" and "production" is the size of the Grand Canyon.

Here's what actually happened when I let AI agents drive my development for two weeks.

AI-generated illustration: robot brain neural network digital art section

The Setup

I'm building a platform that connects freelancers with clients — pretty standard stuff. Next.js 16 frontend, Node.js backend, PostgreSQL, Redis for caching, Stripe for payments. Nothing crazy, but it's got enough moving parts that a single developer needs weeks to wire it all up.

My stack of agents was:

Claude Code (terminal agent — Anthropic's CLI)
GitHub Copilot Agent Mode (in VS Code)
Cursor Agent (the Composer/Agent mode)
OpenAI Codex CLI (the new kid on the block)

I ran each on the same set of tasks and compared the results. Not a scientific lab test — just real-world "can you build this feature?" vibes.

The Good — What Actually Worked

Boilerplate Generation Is Basically Solved

Every single agent handled CRUD generation like a champ. I asked each one to create a new "projects" feature — database schema, API routes, TypeScript types, and a basic UI. All four returned working code that compiled on the first try.

Claude Code was the fastest here, generating about 400 lines across 6 files in roughly 90 seconds. Cursor was a close second at 2 minutes. Copilot Agent took about 3 minutes but included detailed error handling. Codex CLI generated solid code but needed a second pass to match my project's existing patterns.

The thing is, boilerplate is where AI shines. It's read docs, understand patterns, generate consistent code. That part is genuinely useful and saves me probably 3-4 hours per feature.

Unit Tests — Surprisingly Good

I expected AI agents to be terrible at tests because they don't know my project's test setup. But honestly? They crushed it.

Cursor figured out I was using Vitest with testing-library from a single file, then matched the pattern across every new test it wrote. The tests weren't perfect — some edge cases were missing — but they provided solid coverage for the happy path and basic error cases.

Copilot Agent was the best here because it could look at my existing test files and replicate the exact patterns. All mocks in __mocks__ directories, same assertion style, same describe/it structure. It felt like pair programming with a junior who actually pays attention to conventions.

Database Schema Design

This surprised me. I expected garbage, but all four agents generated reasonable PostgreSQL schemas with proper indexes, foreign keys, and even migration files.

— "Wha Code asked clarifying questions before writing schema — "What's the relationship between projects and users?" — which caught me off guard. The other three just wrote what they thought was right, and honestly, they were close enough that I only had to tweak one foreign key constraint.

The Bad — Where Things Got Messy

Dependency Hell Is Real

Here's the thing nobody talks about in those demos: AI agents LOVE installing packages. And they don't clean up after themselves.

Codex CLI installed React Router in my Next.js project. Twice. Claude Code pulled in three different UUID libraries because it couldn't decide which one I was already using. Cursor kept trying to install Prisma even though I'm using Drizzle ORM.

I spent nearly two hours auditing dependencies after the first round of features. The package.json looked like a yard sale — random packages scattered everywhere, no clear pattern, multiple libraries doing the same thing.

The lesson? Pin your tech stack in a CLAUDE.md or cursorrules file. Tell the agent exactly what ORM, what styling solution, what HTTP client you're using. Otherwise it'll guess, and its guesses tend toward "install everything just in case."

The Hallucination Problem

This one's obvious but worse than I expected.

Cursor generated a Stripe webhook handler that referenced a stripe.webhooks.constructEventFromPayload method — which doesn't exist. The actual method is stripe.webhooks.constructEvent. I caught it because I've done Stripe integration before, but a junior developer would absolutely ship that to production and wonder why webhooks were silently failing in staging.

Codex CLI invented an ENTIRE Redis caching library. Not a real package — it just made one up and wrote code that imported it. The import was @acme/redis-cache and it was supposed to do request deduplication, but of course npm install failed and I spent 20 minutes trying to figure out why before realizing it was a hallucinated package.

Copilot Agent was the most reliable here — it stayed closest to real APIs. Maybe because Microsoft has better guardrails, or maybe because it's more conservative about what it generates.

Context Windows Fill Up Fast

You know how demos always show a fresh conversation with a simple request? Real projects don't work that way.

By day 3, Claude Code's context window was already too small to hold my entire project structure. It started forgetting how I'd set up the authentication middleware and would suggest patterns that conflicted with existing code.

Cursor handled this better because it has MCP (Model Context Protocol) that can query my codebase without dumping everything into context. But even then, complex features that touched 10+ files would sometimes lose track of the overall architecture.

The workaround? Break features into smaller chunks. Instead of "build the entire messaging system," do "create the database schema for messages," then "write the API routes," then "build the real-time subscription." Each chunk fits in context.

The Ugly — Surprising Behavior

Agents Have Personalities

I know this sounds weird, but different agents genuinely have different "styles."

Claude Code was the most cautious. Before writing a Stripe integration, it stopped and asked if I'd set up the webhook secret in my environment variables. It even suggested I test with the Stripe CLI first. That kind of guardrail is genuinely helpful You know what I mean?

Cursor was the most aggressive. It would write code, immediately refactor it, then refactor the refactored version. Sometimes I'd review a PR and find three different approaches to the same problem in the same file.

Codex CLI was the most creative but also the most unreliable. It wrote beautiful TypeScript generics that I'd never have thought of, then immediately hallucinated a function signature that didn't match any known API.

Copilot Agent was the most corporate — clean, predictable, boring. Nothing exciting, nothing wrong. If I needed "write a REST endpoint that follows my project's patterns," it delivered every time.

Error Messages Become Dependency

Here's a meta problem I didn't expect: I started relying on the agents to read error messages for me. If a test failed, I'd paste the error into the agent instead of reading it myself. That's terrible for my own skill development.

I caught myself doing this on day 8 and forced a reset. Now I read the error first, try to fix it myself, and only ask the agent if I'm stuck for more than 10 minutes. The agents got better at debugging errors than I'm — which is useful, but also scary.

What I'd Do Differently

If I had to run this experiment again, here's my playbook:

Write a PROJECT.md or CLAUDE.md upfront. List every library, every convention, every architectural decision. The agents respect these files and it cuts hallucinations by maybe 80%.
Never let an agent install packages unsupervised. Review every npm install before it runs. The agents don't know what you already have.
Use agents for blocks, not entire features. A single agent call should produce 50-200 lines, not 2000. Smaller chunks = fewer hallucinations = less debugging.
Read every line of generated code. I know this defeats the purpose of "speed," but the alternative is shipping a hallucinated Stripe method or a fake Redis library. Review everything for the first few weeks until you trust the pattern.
Keep a human in the loop for auth and payments. The agents got payment flows wrong more often than anything else — wrong error handling, missing idempotency, bad edge case handling. These are areas where "close enough" means "you lose real money."

Disclosure: Some of the links in this article are affiliate links. If you purchase through them, I may earn a commission at no extra cost to you. I only recommend products I genuinely find useful.

The Bottom Line

Here's the honest take: AI coding agents in 2026 are incredibly useful but they're not replacing developers anytime soon. What they're good for is:

Eliminating boilerplate — saving 3-4 hours per feature
Generating good first drafts — tests, schemas, basic implementations
Catching patterns — they're great at matching your existing code style
Exploratory coding — trying an approach you wouldn't think of

What they're NOT good for:

Dependency management — they'll install everything
Complex multi-file refactors — context windows are still too small
Production payment flows — the hallucinations get expensive
Replacing your judgment — you still need to review every line

I ended up keeping Cursor Agent and Copilot Agent in my daily workflow. Claude Code comes out for the hard stuff. Codex CLI is promising but not quite there for production work.

Total monthly spend? $30 (Cursor Pro $20 + Copilot Pro $10). For what I'm getting — roughly 2x my output on most days — that's the best deal in software right now.

Have you tried AI coding agents on real projects? I'm genuinely curious what your experience has been — drop a comment and let me know what I got wrong.

DEV Community