Vibe Coding Just Failed Its First Real Audit

#ai #codequality #llm #programming

Book: Prompt Engineering Pocket Guide
Also by me: AI Agents Pocket Guide
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

January 9, 2026. Sonar published the State of Code Developer Survey: 1,100 working developers, asked what AI-generated code looks like once it lands in their repo. The result that ate the news cycle was a single number: 96% of developers do not fully trust the functional accuracy of AI-generated code, and only 48% always verify it before committing. The Register's headline put it bluntly: devs doubt AI-written code, but don't always check it.

Call that the verification gap. The most damning number in the survey is somewhere else.

The number that matters is this: 88% of developers cite negative impacts from AI-generated code, including code that "looks correct but isn't reliable" (53%) and code that is unnecessary or duplicative, per the Sonar report. Pair that with GitHub's Octoverse 2026 figure that 46% of all new code is AI-generated, and the math gets uncomfortable. About half of new code is being produced by a process that more than half of developers say has a "looks-correct-isn't-reliable" failure mode, and only half of developers are verifying any of it.

This is the audit result vibe coding failed.

What the survey is actually saying

Sonar's data, taken together with the JetBrains State of Developer Ecosystem 2026 and GitHub Octoverse numbers, tells a coherent story. 90% of developers regularly use at least one AI coding tool, and 92% of US developers do so daily, per JetBrains. From the same report, 63% have spent more time debugging AI-generated code than they would have spent writing it themselves, at least once.

Two more from Sonar are worth pulling out:

Verification effort is rated "moderate" or "substantial" by 59% of developers. The bottleneck moved from writing to reviewing.
88% cite negative downstream impacts (re-stated above as the lead, but it is the spine of the dataset).

The rough shape: AI lifts the floor on speed, lowers the floor on quality, and shifts the cost into a part of the workflow most teams do not measure. I'll use "vibe coding" here for the failure mode the survey is measuring (the practice of accepting an LLM's output and shipping it without engaging with the details).

AI code is competent at the surface and brittle at the edges. The edges are where production lives.

The four things vibe-coded code consistently misses

Read the audit findings closely (Sonar's, JetBrains', the Hashnode 2026 vibe-coding state-of writeup) and the failures cluster around four omissions. This is my synthesis of the findings, not a categorization any single survey publishes. They show up across languages, across model versions, across prompts. The model ships the happy path; you ship the rest.

1. Error handling. The function does what was asked, then does nothing about the seven ways the work can fail. Network errors, malformed input, timeouts, partial writes, third-party 5xx: swallowed silently or raised fatally at a layer that cannot recover.

2. Idempotency. "Process this payment" works once, then double-charges when the user refreshes. "Send this email" sends three. The model does not know your endpoint will be retried. Nothing about a happy-path prompt forces it to think about being called twice.

3. Retries (with backoff and circuit-breaking). External calls go straight, no retry. Or retries exist but hammer the upstream with for _ in range(5): try: ... until quota dies. A real retry policy: exponential backoff, jitter, max-attempts, dead-letter. Almost none of that appears unless the prompt explicitly demands it.

4. Observability. No structured log, no trace span, no metric, no error-tagged exception. The function works fine in dev, runs blind in production, and when it breaks the on-call has only the stack trace to go on.

These four are not exotic. They are the difference between code that "works on my machine" and code that survives a Tuesday at 2pm. The model omits them for three reasons. The prompt did not ask. The training data is dominated by tutorial-grade snippets that also omit them. Nothing in the vibe-coding loop forces the omission to be visible before merge.

The 50 lines a model gives you

Here is the kind of code a vibe-coding session produces when you ask "write a function that fetches an order from the orders service and stores a refund record in our DB." Illustrative, not lifted from a real session, but the shape is the shape.

# Vibe-coded version. Looks fine on day one.
import os
import requests
import psycopg2

ORDERS_URL = os.environ["ORDERS_URL"]
DB_DSN = os.environ["DB_DSN"]


def refund_order(order_id: str, amount_cents: int, reason: str) -> dict:
    r = requests.get(f"{ORDERS_URL}/orders/{order_id}")
    order = r.json()

    if order["status"] != "paid":
        raise Exception("Order not refundable")

    r = requests.post(
        f"{ORDERS_URL}/orders/{order_id}/refund",
        json={"amount_cents": amount_cents, "reason": reason},
    )
    refund = r.json()

    conn = psycopg2.connect(DB_DSN)
    cur = conn.cursor()
    cur.execute(
        "INSERT INTO refunds (order_id, amount_cents, reason, refund_id) "
        "VALUES (%s, %s, %s, %s)",
        (order_id, amount_cents, reason, refund["id"]),
    )
    conn.commit()
    cur.close()
    conn.close()

    return {
        "order_id": order_id,
        "refund_id": refund["id"],
        "amount_cents": amount_cents,
    }


def main():
    result = refund_order("ord_abc123", 1500, "customer request")
    print(result)


if __name__ == "__main__":
    main()

Read it like a reviewer. Where does this fall over?

The requests.get has no timeout. The orders service goes slow, this hangs forever.
r.json() is called without checking r.ok. A 500 from upstream blows up on order["status"].
The refund POST has no idempotency key. Retry the function and the customer gets refunded twice.
The DB write happens after the refund. If the DB is down, the customer is refunded but you have no record. Reconcile-by-hand territory.
One generic Exception. No logging. No trace context. When it breaks at 2am you get a stack trace and nothing else.

This is the audit finding, in code. It works once. It will break in production within a quarter.

The 30-line rewrite that survives Tuesday

Same function, written like the reviewer is awake. Same intent, same surface, the four omissions filled in.

# ORDERS_URL, DB_DSN imported from config
import logging
import httpx
import psycopg
from tenacity import retry, stop_after_attempt, wait_exponential_jitter

log = logging.getLogger(__name__)
client = httpx.Client(timeout=httpx.Timeout(5.0, connect=2.0))


@retry(
    stop=stop_after_attempt(4),
    wait=wait_exponential_jitter(initial=0.5, max=8),
    reraise=True,
)
def _post_refund(order_id: str, body: dict, idem_key: str) -> dict:
    r = client.post(
        f"{ORDERS_URL}/orders/{order_id}/refund",
        json=body,
        headers={"Idempotency-Key": idem_key},
    )
    r.raise_for_status()
    return r.json()

The retried leg is isolated. The orchestration sits below.

def refund_order(order_id: str, amount_cents: int, reason: str) -> dict:
    idem_key = f"refund:{order_id}:{amount_cents}"
    log.info("refund.start", extra={"order_id": order_id, "idem_key": idem_key})
    with psycopg.connect(DB_DSN, autocommit=False) as conn, conn.cursor() as cur:
        cur.execute("SELECT refund_id FROM refunds WHERE idem_key=%s", (idem_key,))
        if row := cur.fetchone():
            log.info("refund.replay", extra={"order_id": order_id})
            return {"order_id": order_id, "refund_id": row[0], "replayed": True}
        refund = _post_refund(order_id, {"amount_cents": amount_cents, "reason": reason}, idem_key)
        cur.execute(
            "INSERT INTO refunds (order_id, amount_cents, reason, refund_id, idem_key) "
            "VALUES (%s, %s, %s, %s, %s)",
            (order_id, amount_cents, reason, refund["id"], idem_key),
        )
        conn.commit()
    log.info("refund.ok", extra={"order_id": order_id, "refund_id": refund["id"]})
    return {"order_id": order_id, "refund_id": refund["id"], "amount_cents": amount_cents}

What changed, mapped to the four. Error handling is now real: timeouts on the HTTP client, raise_for_status() on every response, and the DB cursor in a context manager so failure rolls back and success commits. Idempotency is enforced by a deterministic idem_key from order_id + amount_cents; the DB is checked for an existing refund with that key before the upstream call, the upstream gets the key in a header so the orders service can dedupe too, and replays return the original refund without double-charging.

The remaining two:

Retries. tenacity with exponential backoff, jitter, and a hard cap of 4 attempts. No infinite loops. No hot retry. Retries only happen on the upstream POST, which is the volatile leg.
Observability. Three structured log events with consistent fields: refund.start, refund.replay, refund.ok. A real on-call can grep, alert on missing refund.ok after refund.start, and reconstruct the path of a single order from logs alone.

About 30 lines in the orchestration body. The first version was about 50. This is not "more code." It is the same code, with the four things vibe coding leaves out.

The prompt change that gets you 80% there

Most vibe-coded output gets close to the 30-line version if you ask correctly. A prompt template that has worked across teams I have seen:

Write <function description>.

Constraints — every one of these must be addressed
explicitly in the code:
1. Error handling: every external call must have a
   timeout and a status check; failures must propagate
   with context, not as bare exceptions.
2. Idempotency: if this function may be retried by a
   caller, it must produce the same effect on retry.
   State the idempotency key.
3. Retries: any call that crosses a network must use
   exponential backoff with jitter and a max-attempts
   cap. Do not retry on 4xx.
4. Observability: emit structured log events at start,
   on replay, on success, and on each failure mode.
   Use the project's log fields.

Before the code, list which of the four constraints
apply and how you address each. Then write the code.

This shifts the model from "produce a function that works on the happy path" to "produce a function that survives all four production failure modes." It does not eliminate review. The survey's 96-vs-48 verification gap is still yours to close. But it gets the floor of the generated code closer to the floor of code your reviewer would write themselves.

The teams getting away with high AI-code volume in 2026 are not the ones that prompt better. They are the ones that prompt the four constraints, run a verifier on the output (lint, type-check, security scan), and treat AI output as a junior PR: useful, fast, never trusted on its first read.

These aggregate findings reflect averages across audited samples and surveyed developers; they do not characterize any single tool or vendor.

What the survey says about your career

In 2026, generating code with AI is table stakes. Everyone does that. The skill that pays is being the person who notices the four omissions in the AI's draft, who writes the prompt that demands them up front, and who maintains the harness (tests, scans, traces) that catches the cases where the prompt was not enough. That is what "senior" means now. The audit results are the evidence.

If this was useful

The four-constraint prompt above is the small version of patterns covered in the Prompt Engineering Pocket Guide: making model output reviewable, structured, and survivable. If you are running multiple LLM calls in a chain, or letting an agent execute the code it just wrote, the AI Agents Pocket Guide covers the autonomy / verification trade-off the survey is really measuring.