Shiplight

Posted on Apr 11 • Originally published at shiplight.ai

Vibe Coding Is Fun Until Production Breaks

#ai #testing #webdev #vibecoding

Vibe coding is exactly what it sounds like: you describe what you want, your AI coding agent writes the implementation, and you ship it. No wrestling with boilerplate, no context-switching into unfamiliar APIs, no debugging stack traces line by line. Just intent → code → deploy.

It is genuinely fast. Teams that have adopted AI-first development workflows report shipping features in hours that previously took days. The experience is intoxicating.

The problem shows up in production. Not always immediately, not always dramatically — but consistently. A checkout flow that worked in the demo breaks for users in a specific browser. An edge case in the new auth logic causes silent failures. A UI component that the agent refactored now behaves differently when the viewport changes. The AI wrote correct code for the happy path, but nobody verified the full surface area.

This is the vibe coding quality gap: the speed gain is real, but the verification step got left out.

What Vibe Coding Actually Skips

Traditional software development has a built-in quality loop. Developers write code, run tests, review diffs, and iterate before shipping. Each step adds friction — but that friction catches bugs.

Vibe coding compresses this loop dramatically. The agent writes the code, you review a high-level summary, and the diff goes out. The problem is that the review step scales poorly with the agent's output. A human can meaningfully review 50 lines of code. Reviewing 500 lines of agent-generated implementation across five files is a different task entirely.

What actually gets skipped in most vibe coding workflows:

End-to-end verification — does the feature actually work from a user's perspective?
Regression coverage — did the agent's changes break something it wasn't supposed to touch?
Edge case validation — what happens with empty states, network failures, or unexpected inputs?
Cross-browser consistency — did the agent's CSS choices work everywhere?

These are not hypothetical concerns. Research on AI-generated code quality consistently shows that AI-written code introduces bugs at higher rates than carefully reviewed human code — not because the models are bad, but because the verification loop is truncated.

The Speed Trap

Here is the dynamic that makes vibe coding quality gaps compound over time.

When you ship fast and something breaks, the natural response is to have the agent fix it. The agent patches the bug, you ship the patch, and you move on. This works fine for isolated issues. But over weeks and months, an unverified codebase accumulates a debt of untested edge cases. Each fix potentially introduces new issues. The agent has no memory of what it previously changed or why.

Without a persistent test suite, you have no ground truth. You cannot tell whether the latest agent commit made things better or worse in aggregate. You only find out when a user reports something.

This is not a problem with the AI coding agents themselves — they are doing exactly what they were designed to do. It is a workflow design problem. The quality layer was never added.

Adding QA to Your Vibe Coding Workflow

The good news is that vibe coding and comprehensive testing are not in conflict. The same agents that write your application code can be directed to write tests, run verifications, and maintain a quality gate — if you give them the right tools.

Step 1: Give your agent a browser

The most immediate gap in vibe coding workflows is live browser verification. Your agent can write a component, but it cannot see what that component looks like or how it behaves without a browser.

Shiplight's browser MCP server gives your AI coding agent eyes and hands in a real browser. During development, the agent can open your application, navigate through the new feature, and verify that what it built actually works — before the code leaves your machine.

This closes the most common vibe coding failure mode: code that passes linting and type checks but fails in practice.

Step 2: Capture verifications as regression tests

Every time your agent verifies a feature in the browser, that verification can become a permanent test. Shiplight converts browser interactions into YAML test files that live in your repo and run automatically in CI.

These are not brittle tests that break every time your UI changes. The tests are written against the intent of each step ("Click the submit button", "Verify the confirmation message appears"), not against specific DOM selectors. When your agent makes future changes, the tests adapt rather than fail on superficial differences.

Step 3: Run tests on every agent commit

Once you have a test suite, wire it into your CI pipeline so every agent-generated commit gets verified before merge. Shiplight's GitHub Actions integration makes this a one-time setup.

The result: your agent can ship code at full vibe coding speed, and you get a regression gate that catches problems before they reach production.

The Intent-Cache-Heal Pattern for Vibe Coders

Traditional test automation breaks constantly because tests are tied to implementation details — specific CSS selectors, DOM structure, element IDs — that agents change freely. This is why most vibe coding teams do not bother with E2E tests: the maintenance burden exceeds the value.

The intent-cache-heal pattern solves this. Tests describe what the user is trying to accomplish, not how the UI is currently built. When your agent restructures a component, the test heals automatically because the intent has not changed — only the implementation.

This is the missing piece that makes comprehensive testing compatible with vibe coding's pace. You are not maintaining tests after every agent commit. The tests maintain themselves.

What a Vibe Coding + QA Workflow Looks Like

A practical workflow looks like this:

Describe the feature to your agent (Claude Code, Cursor, Codex, or any MCP-compatible agent)
Agent implements the feature and opens it in a real browser via the Shiplight MCP server
Agent verifies the feature works end-to-end and documents the verification as a YAML test
CI runs the test suite on the pull request — any regressions block the merge
Agent fixes flagged issues with the context from the test failure output
Merge with confidence — the full feature surface is verified

The agent handles steps 2 through 5. Your job is to define the intent and review the evidence. That is what vibe coding should feel like.

Frequently Asked Questions

What is vibe coding?

Vibe coding is a development style where developers use AI coding agents to write code by describing intent in natural language. The AI agent handles implementation while the developer focuses on what the product should do rather than how to build it.

Why does vibe coding produce bugs?

Vibe coding itself does not produce more bugs than traditional development — but the truncated review cycle means bugs are caught later. AI coding agents write for the specified requirements and may miss edge cases, cross-browser differences, or regressions in code they did not explicitly touch.

Can AI agents write their own tests?

Yes. With the right tooling, AI coding agents can generate tests automatically from their own verifications. Shiplight's MCP server lets agents verify features in a real browser and capture those verifications as self-healing YAML test files that live in your repo.

Does adding tests slow down vibe coding?

Not significantly, when tests are generated automatically by the agent rather than written by hand. The overhead is a one-time CI setup. After that, tests run in the background and only interrupt the workflow when a real regression is found.

How do self-healing tests work with frequently changing UIs?

Self-healing tests are written against the intent of each user action, not specific DOM selectors. When the UI changes, the test framework resolves the correct element by matching the described intent to the current page state. See What Is Self-Healing Test Automation for a full explanation.

References: Playwright Documentation, GitHub Actions documentation

Top comments (1)

Harjot Singh • May 31

The "until production breaks" genre keeps growing because it's the universal lived experience now - and the breakages cluster in predictable places: race conditions and N+1 queries the AI never load-tested, missing error handling on the unhappy paths, secrets/config that worked locally and exploded in prod, and auth that trusted the client. None of it shows in a demo; all of it shows at 2am with real users.

The takeaway I act on: the demo and production are different deliverables, and the gap closes with structure (gates, the boring 20% as defaults), not with a better prompt. It's the core of Moonshift - a multi-agent pipeline that ships a prompt to a real SaaS on your own GitHub + Vercel with auth/billing/DB/deploy handled and verified rather than vibed, so prod-breaking gaps are caught before they ship. Multi-model routing keeps a full build ~$3 flat. First run's free, no card. Good post - what was the specific thing that broke for you in prod? The war stories are where everyone actually learns.