AI Coding Assistants Don't Suck Anymore

#softwaredevelopment #ai #cursor #claude

The hype is fading, the hallucinations are dropping, and the robots finally feel like teammates instead of toddlers with keyboards.

☕️ Quick Sip Summary

New models, fewer face-palms. Claude Sonnet 4 (via Cursor), GPT-4.5, and Gemini 2.5 crank hallucinations down to ~15%.
Speed is real, trust is tricky. A controlled METR study showed a 55% speed boost, but Stack Overflow’s 2025 survey says only 29% of devs actually trust AI output.
My reality check: AI now handles the boring 80%, but the final 20% is still very human. I’ve never merged a PR without running tests and giving the code a side-eye.

1. The Day My “Intern” Grew Up

Back in early ’24, AI coding tools felt... twitchy. I remember asking it to build a login page and getting something that not only skipped validation but practically whispered "hey, let’s hardcode a password for fun."

But I didn’t quit on it. I kept using it — more cautiously at first — learning where it helped and where it hallucinated. It was like watching a junior dev slowly grow up. The suggestions started making more sense. The bugs showed up less often. And somewhere between dozens of commits and a few thousand prompts, I realized: I was trusting it more than I thought.

Then came May 2025. Cursor integrated Claude Sonnet 4, and everything clicked. I typed: “Scaffold a Nuxt 3 page listing Stripe invoices with a Tailwind table.”

Two lattes later, I had a fully working page: clean props, sensible Tailwind, pagination built-in, no missing imports. It didn’t just look good — it ran.

Why the glow-up?

The AI Hallucination Benchmark 2025 clocks GPT-4.5 and Sonnet 4 at ~15% hallucinations (down from over 50%).
The METR study shows AI can make devs 55% faster on focused tasks.

That aligned with what I was feeling: the tool had matured — and maybe so had I.

2. Where AI Shines — And Where It Slips

After months with Sonnet 4, here’s what’s felt like magic — and what still gives me pause:

What works well:

Boilerplate scaffolding (components, DTOs)
Writing unit tests
Repo Q&A like “Where do we parse JWTs?”
Project-wide refactors

Where I don’t trust it (yet):

Security-critical flows like auth and crypto
Perf-sensitive logic
Legacy spaghetti with zero documentation
Tasks I haven’t mentally designed yet

If I don’t know exactly what I want, the model will happily hallucinate an entire fantasy architecture. Then I get to debug my own laziness at 2 a.m.

3. My Four-Step Prompt Ritual 🙏

Here’s how I talk to the model now:

Set the scene. “You're a senior dev experienced in Nuxt and Stripe.”
Describe the goal. “Implement server-side pagination for /api/invoices.”
Set the stack. “Nuxt 3, Prisma, PostgreSQL, limit 50 rows, return totalCount.”
Guide the scope. “Please outline the steps only.”

From there, I review its plan, give feedback, and go step-by-step through implementation. It feels like pair programming — minus the headphone tugging.

4. Trust Issues: Everyone’s Using It, Nobody’s Sleeping Easy

80% of devs use AI tools (according to Stack Overflow 2025).
Only 29% trust the output unedited.

Reddit is full of rants like “Management wants 20% of commits from Copilot.” One post even mentioned execs tracking prompt counts per day.

That’s not how I roll. I’d rather measure features shipped, not lines of AI-assisted code.

5. The Numbers Don’t Lie

One recent feature:

Old-school dev time: ~3h 45m
- 45m for scaffolding
- 120m for core logic
- 60m for tests
With Cursor + Sonnet 4: ~1h 45m
- 5m prompt for scaffolding
- 90m prompts + tweaks
- 10m for tests

That’s two hours saved — enough for a gym session, or let’s be real, another coffee and a doomscroll.

6. Five Things I Wish I’d Known Sooner

Design before you prompt. AI isn’t great at mind-reading.
Break it up. Big tasks become small, accurate prompts.
Test everything. No green CI, no merge.
Sleep on major merges. AI optimism is real — and sneaky.
Don’t ditch juniors. Pair them with AI, then make them explain every line.

7. So… Should You Trust the Robot?

Yes — but only like you’d trust a hyper-literal intern. Brilliant with grunt work. Hopeless with nuance. When I give it structure and oversight, it makes me faster. When I hand it the wheel, it usually drives into a wall of undefined variables.