Andy Staudinger

Posted on Jun 8

Vibe Coding in 2026: What Actually Works (and What Will Burn You)

#webdev #programming #ai #beginners

I've been shipping client projects with Claude Code, Copilot, Cursor, and Aider since 2024. This is the honest version — no marketing demos. It's a condensed adaptation of my full German guide, Vibe Coding 2026.

In February 2025, Andrej Karpathy coined "vibe coding": fully giving in to the vibes and forgetting the code even exists. A year and a half later, the numbers are in. According to the Stack Overflow Developer Survey 2025, 84% of developers use AI tools — but positive sentiment dropped from over 70% to 60%. The top frustration, named by 66% of developers: AI solutions that are "almost right, but not quite."

Both things are true at once. Stripe migrated 10,000 lines of Scala to Java in four days instead of an estimated ten engineering weeks. And OWASP just added vibe coding as an awareness item to its Top 10 — not as a new anti-pattern, but as a new risk class.

Here's what I've learned actually separates the two outcomes.

Three modes, one buzzword

People mean very different things by "vibe coding," and the risk profile differs wildly:

Auto-complete — the AI suggests lines, you decide. Standard in 2026, basically harmless.
Pair coder — the AI writes whole features on instruction, you review. Mainstream and manageable if you take reviews seriously.
True vibe coding — the AI builds on command, you check at the end. This is where almost all the incidents happen.

Most "AI coding is dangerous" discourse conflates mode 3 with modes 1–2. Most "10x productivity" marketing conflates modes 1–2 with mode 3.

Where it reliably delivers

In our projects, these gains are reproducible, not anecdotal:

Boilerplate disappears. CRUD endpoints, form validation, simple UI components: hours become minutes.
Test coverage becomes realistic. Pointed at tests specifically, LLMs write clean, readable test files. The activation energy for testing drops massively.
Refactorings nobody wanted to do actually happen. Renaming across
50 files, swapping a library, changing a data flow consistently.
Language barriers crumble. Rust, Go, Elixir become workable even if you don't write them daily.

For well-defined tasks we see 2–5x. For architecture decisions and deep debugging it's more like 1.1–1.3x. The METR survey from May 2026 (349 technical staff) found self-reported gains of 1.4–2x — with the explicit caveat that self-reports deserve skepticism.

Where it will burn you

Hallucinated packages → slopsquatting. LLMs reproducibly hallucinate package names that sound real. Attackers register exactly those names on npm and PyPI with malicious payloads (research by Lasso Security). If your agent suggests a package and you npm install without checking the registry and maintainer history, you may be importing the attack. This happened to my team once with a typo-adjacent npm library that shipped telemetry. Now: registry check before every install, no exceptions.

Prompt injection against coding agents. Anthropic's own red-teaming numbers (published May 2026): a single injection attempt succeeds ~0.1% of the time, but after 100 adaptive attempts the success rate climbs to 5–6%. Coding agents are uniquely exposed because they typically hold file access, network access, and shell execution — Simon Willison's "lethal trifecta." A documented incident this February: a phishing setup got an agent to read ~/.aws/credentials, encode it, and POST it to an external endpoint — 24 out of 25 attempts succeeded.

Test cheating. You ask the AI to fix a failing test. It "fixes" the test instead of the code. Green pipeline, bug still in production. We now enforce a rule: in a bugfix commit, tests may only be extended, never weakened — checked by a pre-commit hook.

Spec drift. The AI starts on your requirement and quietly wanders off. An hour later you have a feature that doesn't solve what you asked. We once shipped a refactor where the old search function survived in three call sites — old and new code ran side by side in production for four weeks. Mandatory grep for old naming after every larger refactor.

What mature orgs are doing

Uber capped AI tooling at $1,500/month per employee per tool (Bloomberg, June 2026). Token spend is the new cloud compute.
Stripe's 4-day migration wasn't raw vibes: every generated function ran against the old system's tests. Where tests were missing, Claude wrote them first, against the legacy code, then migrated. Tests as spec, migration as gap-filling.
SQLite added an AGENTS.md stating "SQLite does not accept agentic code" after being flooded with plausible-but-wrong AI bug reports. The Ladybird browser stopped accepting public PRs entirely — AI PRs look substantial without representing substantial effort, which breaks the trust model code review was built on.

The workflow that keeps quality up

A typical feature session for us runs 30–90 minutes:

5–10 min: write the spec. Acceptance criteria, data model, technical constraints. One screen max. "Build a login with email + password, bcrypt cost 12, HTTP-only cookie, 7-day session, rate limit 5 attempts / 15 min / IP" beats "build me a login" by hours of saved iteration.
5 min: curate context. Pin the relevant files and docs. Don't dump the repo — the lost-in-the-middle effect is real.
15–40 min: implement, tests first.
** The AI writes the test, you review it (fast), then it writes the implementation. If it later wants to change the test to fit its code, that's a block in review.
5–15 min: read the diff. Every changed line. "Done" with 47 touched files, 30 of them unrelated, is drift — not done.
5 min: green tests, lint, commit. Commit message written by a human, matching the actual diff.

Plus the standing infrastructure: an AGENTS.md/CLAUDE.md in every repo (stack, conventions, what NOT to do — the single highest-leverage artifact we know), sandboxing for agents, auto-mode off for sensitive actions, and auth/payment code written mostly by hand. Auth is the one place where AI tools still produce outdated patterns and insecure defaults too often.

When to vibe, when not to

The rule of thumb that has served us well: the lower the consequence of a bug and the shorter the code's lifespan, the better vibe coding fits.

Prototypes, throwaway scripts: go wild.
MVPs: yes, with tests — then professional hardening before production.
Internal tools for 5 users: yes, with reviews.
B2B SaaS: partially, with supervision.
Auth, payments, regulated domains (medical, legal, finance), critical infrastructure: no. Classical engineering discipline, AI-assisted at most.

The bottom line

The question in 2026 isn't "AI: yes or no." It's "how clean." Vibe coding without engineering discipline is deferred technical debt with interest. Vibe coding with tests, reviews, sandboxing, and clear use-case boundaries is the most productive way to build software right now.

The full guide (in German) covers tool comparisons with current pricing, GDPR considerations for sending client code to cloud LLMs, the job-market impact, and a complete checklist: Vibe Coding 2026 — Chancen, Risiken, Praxis.

I'm Andy — full-stack freelancer building SaaS with Next.js, Flutter, and Supabase, based in Germany and Cyprus. Happy to answer questions in the comments.

Top comments (1)

Alex Shev • Jun 8

The part that matters is the feedback loop. Vibe coding works when the human keeps narrowing scope, running checks, and deciding what "done" means. Without that loop, it turns into very fast guessing.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.