DEV Community: Anatoly Silko

How to Audit a Laravel Codebase You've Inherited

Anatoly Silko — Mon, 25 May 2026 15:42:40 +0000

Most businesses don't inherit a Laravel codebase on purpose. A developer leaves. An agency relationship ends. A company is acquired. A freelancer goes quiet. However it happens, the result is the same: you now own a working application, built by someone you may never speak to again, with no clear picture of what's inside it.

This is not unusual. Digital agencies experience average client churn rates of 42% annually for project-based work and 18% for retainer relationships, with an average client lifespan of just 24 months for project-based engagements (Focus Digital, Agency Churn Report 2025). Separately, 81% of UK businesses report being negatively affected by IT and tech skills shortages (Hyve Managed Hosting, IT & Tech Skills Gap Report 2024). The inherited codebase scenario isn't an edge case — it's the default outcome of a market where developer tenure, agency relationships, and project continuity rarely align.

The question you're facing isn't whether to audit. It's how to audit properly — distinguishing genuine risks from cosmetic noise, and understanding what the tools are actually telling you.

This article covers both sides. If you're a technical lead, it walks through the audit toolkit, what each tool finds and misses, and what "good" looks like in concrete benchmarks. If you're a managing director or founder who can't read PHP, it gives you a non-technical checklist you can run yourself before commissioning a professional review — and a framework for interpreting what the professionals report back.

The companion articles in this series cover what happens when your developer leaves, what happens when nobody applies security updates, the warning signs that your application has become a liability, and how much it costs to rescue a neglected codebase. This article is about the audit itself — the process, the tools, the interpretation.

Start with what you can check without opening the code

If you're a non-technical business owner, you don't need to wait for a developer to tell you whether you have a problem. There are things you can verify right now, today, that cost nothing and take less than an hour.

Do you own the domain?

UK domains (.co.uk and .uk) are managed by Nominet, and you can verify ownership through a WHOIS lookup in minutes. The risk is real: domains are routinely registered by agencies or developers using their own details rather than the client's. If a dispute arises, Nominet's resolution process takes roughly ten weeks and costs £200–£750+VAT. Since 2001, over 16,000 domain disputes have been resolved through this process, and the vast majority result in transfer to the complainant (Nominet DRS). But prevention is considerably cheaper than dispute resolution.

Do you have access to the Git repository?

The codebase should live in a version control system — GitHub, GitLab, or Bitbucket — and the account should belong to your company, not to a departed individual. If you can't access the repository, you can't see the history of changes, you can't grant access to a new developer, and you don't truly control the code.

Do you have server and hosting credentials?

Can you log into the hosting dashboard? Do you know who controls the SSL certificate? Are third-party service accounts — Stripe, Mailgun, AWS, whatever the application uses — registered under company email addresses or someone's personal account?

When was the last deployment?

If nobody can tell you when the application was last updated, that's a data point in itself. An application that hasn't been deployed in six months is accumulating unpatched vulnerabilities at a predictable rate — we covered the specific CVE timelines and exploitation data in Laravel Security: What Happens When Nobody's Applying Updates.

Are backups running?

Not "do backups exist" — when was the last verified restore? A backup that has never been tested is a hypothesis, not a safety net.

Is there any documentation?

A README file, deployment instructions, architecture diagrams — anything. 78% of developers joining a new project find navigating an unfamiliar codebase challenging or very challenging (JetBrains Platform Blog, March 2026). Documentation problems typically consume 15–25% of engineering capacity (DX/GetDX). If no documentation exists, the first developer you hire will spend their initial weeks (and your money) figuring out what the previous developer already knew.

None of these checks require technical skill. All of them tell you something material about the state of the asset you've inherited. If the answers to several of these are "no" or "I don't know," you have the answer to whether a professional audit is worth commissioning.

An application that hasn't been deployed in six months is accumulating unpatched vulnerabilities at a predictable rate. A backup that has never been tested is a hypothesis, not a safety net.

The audit toolkit: what each tool actually finds — and what it misses

A competent Laravel audit uses multiple tools because no single tool covers everything. Each instrument examines one dimension of code health. Understanding what each tool does — and does not — check is the difference between reading audit findings intelligently and being overwhelmed by a wall of numbers.

Static analysis: PHPStan and Larastan

PHPStan analyses your code without running it, looking for type errors, undefined methods, unreachable code, and argument mismatches. It operates on a scale of levels 0 to 9, each progressively stricter:

Levels 0–2 catch the basics: unknown classes, wrong argument counts, possibly undefined variables. Level 3 adds return type checking. Level 5 adds argument type validation. Level 6 flags missing type hints. Levels 8 and 9 enforce strict nullability and mixed-type safety.

Larastan extends PHPStan specifically for Laravel, resolving the "magic" that makes Laravel powerful but also makes static analysis difficult — Eloquent models, Facades, the service container, query builders, and collection methods all get proper type inference.

The practical reality: error counts typically double between level 5 and level 8 on the same codebase (Tomas Votruba, phpstan-bodyscan). The default Larastan configuration starts at level 5. No published data exists on what percentage of production codebases pass level 5 or above, but anecdotally, most inherited applications start at level 0 and work upward. Introducing PHPStan at level 8 on a payment processing system reportedly caught errors that would have caused production incidents (fsck.sh).

What it does not check: design smells, code style, security vulnerabilities, or runtime behaviour. PHPStan tells you whether the code is technically correct. It does not tell you whether it's well-designed.

Code style: Laravel Pint

Laravel Pint, included by default in every new Laravel application since version 9.21, enforces consistent formatting based on PHP-CS-Fixer. It auto-fixes issues — brace placement, array syntax, concatenation spacing, namespace imports, type declarations — rather than merely reporting them.

Style matters beyond aesthetics. A study analysing 2.2 million lines of code found that better readability correlates directly with fewer defects (Axify). Separate research across 54 open-source projects and 112,266 commits found a positive quality effect when code analysis tooling was present (Empirical Software Engineering, Springer). The mechanism is straightforward: inconsistent formatting makes code harder to read, harder reading increases the probability of missed bugs during review, and missed bugs compound into production issues.

What it does not check: bugs, types, security, or logic errors. Pint ensures the code is consistently formatted. It says nothing about whether it works.

Dependency scanning: composer audit and npm audit

composer audit, built into Composer since version 2.4, checks your PHP dependencies against the Packagist Security Advisory database. Since Composer 2.7, it also flags abandoned packages — projects whose maintainers have walked away.

The PHP ecosystem context: 80% of application dependencies remain un-upgraded for over a year (Sonatype, 10th Annual State of the Software Supply Chain, 2024). Yet 95% of the time a vulnerable component is consumed, a fixed version already exists. The gap between "a patch is available" and "the patch is applied" is where most of the real-world risk lives.

On the frontend side, npm audit checks JavaScript dependencies against the GitHub Advisory Database. The npm ecosystem faces a more acute threat: over 99% of identified open-source malware appeared on npm in 2025, with 454,600 new malicious packages identified in that year alone (Sonatype, 2026 Supply Chain Report). The cumulative total now exceeds 1.2 million known-malicious packages across all ecosystems.

What neither tool checks: source code vulnerabilities, logical bugs, misconfigurations, or zero-day exploits. They catch known problems in known packages. They do not examine your custom code.

Comprehensive scanning: SonarQube

SonarQube's Community Edition covers PHP 5.0 through 8.4, including Laravel and Symfony, with over 270 built-in static analysis rules. It produces metrics across six dimensions: bugs, vulnerabilities, code smells, security hotspots, test coverage (imported from external reports), and code duplication.

SonarQube's default quality gate for new code requires zero new bugs, zero new vulnerabilities, all security hotspots reviewed, code coverage of 80% or above on new code, and duplication of 3% or less on new code (SonarQube documentation). The tool's own documentation targets zero false positives for bugs, and over 80% true positives for vulnerability detection.

What it does not check: runtime behaviour, infrastructure configuration, or Laravel-specific patterns.

Laravel-specific auditing: Enlightn

Enlightn is the only tool built specifically for auditing Laravel applications. The open-source version runs 66–67 automated checks; the Pro version runs 131. Checks span three categories — performance, security, and reliability — and cover Laravel-specific concerns that generic tools miss: route and config caching, N+1 queries, middleware bloat, CSRF configuration, cookie security, mass assignment exposure, queue configuration, and environment validation.

What it does not check: general PHP issues outside Laravel-specific patterns.

Design quality: PHPMD

PHP Mess Detector examines dimensions that PHPStan deliberately ignores: cyclomatic complexity, NPath complexity, coupling between objects, excessive method and class length, naming conventions, unused code, depth of inheritance. Its default cyclomatic complexity threshold is 10 — the upper limit Thomas McCabe proposed in his original 1976 paper, and the threshold still endorsed by the Software Engineering Institute at Carnegie Mellon.

What it does not check: type correctness, security vulnerabilities, or formatting.

Automated refactoring: Rector

Rector parses PHP into an abstract syntax tree, applies transformation rules, and regenerates the modified code. With 824 total rules and a dedicated rector-laravel package providing 100+ Laravel-specific transformations, it can automate version upgrades, modernise deprecated patterns, and enforce consistency. One UK-based Official Laravel Partner reports that 20–40 hours of manual upgrade work for a medium application can be partially automated by Rector in minutes.

The critical caveat: Rector transforms code — it does not verify semantic correctness. Without a test suite, automated refactoring can introduce bugs silently. It is a power tool, not a safety net.

What's missing from the list above

A thorough audit also uses Psalm (Vimeo's static analyser, which adds taint analysis for SQL injection and XSS detection that PHPStan lacks natively), PHPCPD (copy-paste detection for identifying duplicated code blocks), Deptrac (enforcing architectural boundaries so layers don't violate dependency rules), and OWASP ZAP for dynamic application security testing against the running application — finding runtime vulnerabilities that no static tool can see.

What "good" actually looks like: benchmarks for a healthy codebase

Audit tools produce numbers. Without benchmarks, those numbers are meaningless. Here's what the published data says about where the lines fall.

Test coverage

Martin Fowler's widely-cited guidance: aim for the upper 80s to 90%, and be suspicious of anything claiming 100%. Google's internal research, published at ESEC/FSE 2019, is more granular: 60% is acceptable, 75% is commendable, 90% is exemplary. SonarQube's default quality gate requires 80% coverage on new code. The Laravel framework itself maintains approximately 76% line coverage, with heavily-used core classes (Query Builder, Router) above 90%.

The JetBrains State of Developer Ecosystem survey found that 31% of PHP developers don't write tests at all. For a custom business application built by a solo developer or small agency, finding zero test coverage is common. Finding 40–60% coverage is respectable. Finding 80%+ is genuinely good.

Google's testing blog puts it plainly: the gains of increasing coverage beyond a certain point are logarithmic, but taking concrete steps to move from 30% to 70% is where the real value lies. If you've inherited a codebase at 0%, the goal isn't 90%. The goal is getting critical paths — authentication, payments, data mutations — covered first.

Code duplication

SonarQube's default threshold: 3% or less duplicated lines on new code. Industry guidance is more forgiving for legacy codebases: below 5% is considered optimal, 5–10% is acceptable, and above 10% requires immediate attention (KPI Depot). For codebases inheriting AI-generated code, SonarSource's own recommendation tightens to 1% or less.

Cyclomatic complexity

McCabe's original 1976 recommendation — a method-level upper limit of 10 — remains the industry standard. The Software Engineering Institute at Carnegie Mellon classifies 1–10 as simple and low risk, 11–20 as moderate, 21–50 as complex and high risk, and above 50 as effectively untestable. PHPMD uses 10 as its default threshold.

In practice, an inherited Laravel codebase will contain methods above 10. The question is where. High complexity in a controller action that handles payment processing is a red flag. High complexity in a one-off data migration script is not.

Dependency freshness

80% of application dependencies remain un-upgraded for over a year (Sonatype, 2024). Expecting 100% of dependencies to be current at all times is unrealistic. A well-managed application should aim to be no more than one minor version behind on critical dependencies, with a regular cadence (monthly or quarterly) for reviewing and updating.

Red flags, amber flags, and cosmetic noise

Not everything an automated tool flags matters equally. The most important skill in reading audit results is severity classification — distinguishing the findings that demand immediate action from the ones that can wait, and both of those from the noise that looks alarming but affects nothing.

Red flags: immediate risk

These warrant action before anything else. They represent security exposure, data loss risk, or production instability.

A .env file accessible from a browser — containing database credentials, API keys, and the application's encryption key. APP_DEBUG set to true in production — exposing full stack traces, environment variables, and database queries to any user who triggers an error. Hardcoded credentials committed to the Git repository. SQL queries built with raw string concatenation of user input. Admin routes without authentication middleware. No backups, or no verified restore capability. Eloquent models without $fillable or $guarded properties — leaving every database column open to mass assignment. Missing CSRF protection on forms. Missing database constraints where the application logic assumes uniqueness or referential integrity.

An audit that documented a real inherited Laravel codebase found forms susceptible to CSRF attacks, PSR standardisation gaps, missing namespaces, classes with thousands of lines, over 30 switch statements indicating missing polymorphism, and no dependency injection — all in a single 24-page report (Zaengle Corp).

Amber flags: technical debt, not immediate danger

Outdated but still-supported dependencies. Low test coverage (below 20%). Inconsistent code style and naming conventions. No CI/CD pipeline — manual deployments. No structured logging. N+1 query problems. Oversized classes and methods. TODO comments without corresponding backlog items.

These won't cause a production incident tomorrow. They will cause the next developer to move slowly, make mistakes, and cost you more than they should.

Cosmetic findings: ignore these first

Commented-out code in Laravel config files — these ship with the framework by default. PSR formatting violations in non-public code. Missing docblocks on methods that already have full type hints (Spatie's own guidelines explicitly say not to add them). Minor complexity warnings on service providers or configuration classes that are inherently complex by nature.

This matters because alert fatigue is real and documented. 70% of a security team's time is spent investigating false positive alerts. 33% of companies have been late responding to actual cyberattacks because teams were occupied with false positives. Each false positive takes an average of 32 minutes to investigate (Snyk, 2025). If you let the noise drown out the signal, you will spend your audit budget on cosmetic fixes while the .env file remains publicly accessible.

What a competent professional audit actually covers

If the non-technical checklist and the automated tools are the first two layers, the professional expert review is the third — and the most valuable. An experienced Laravel auditor doesn't just run the tools listed above. They interpret the results in context, examine dimensions that tools cannot see, and produce a prioritised assessment that tells you what to do, in what order, and why.

The most comprehensive published Laravel audit methodology, from a US-based Official Laravel Partner (Ravenna Interactive), covers seven core categories: architecture and boundaries (where business rules live, duplicated logic, tight coupling, "god objects"); security (authorisation correctness, multi-tenant isolation, sensitive data handling); data integrity and concurrency (missing database constraints, risky read-then-write sequences, non-idempotent payment handlers); performance and scalability (N+1 queries, indexing strategy, synchronous work in HTTP requests, queue design); test strategy and change safety (coverage of risky paths — billing, permissions, state transitions); dependency and supply-chain risk (outdated packages, known CVEs, pinning strategy, version lifecycle); and deployment, runtime, and observability (environment parity, rollback plans, migration safety, logging quality, queue monitoring).

The OWASP Top 10 maps specifically to Laravel in ways worth understanding. Laravel's defaults are strong against injection (Eloquent uses PDO parameter binding), XSS (Blade's {{ }} auto-escapes by default), CSRF (middleware enabled by default), and cryptographic failures (built-in bcrypt/Argon2 hashing). But Laravel requires explicit developer action for broken access control (Gates and Policies exist but are not enforced unless applied), security misconfiguration (APP_DEBUG=true ships as the default), supply chain failures (no built-in dependency vulnerability scanning), insecure design (no framework can fix missing threat modelling), and logging and alerting (Monolog is included but minimally configured). The top three OWASP categories are, in the words of a specialist Laravel security auditor, "common weaknesses I find when auditing Laravel apps" (Stephen Rees-Carter, Securing Laravel).

A US-based Laravel specialist who has completed hundreds of security reviews describes their process as: code review for vulnerabilities first, then a knowledge-applied penetration test using the code review findings, followed by common area checks, with continuous dialogue throughout. They explicitly note that a code review is not a penetration test — these are distinct services that complement each other.

How long it takes and what it costs

For a typical Laravel application — 50 to 200 models, 10,000 to 50,000 lines of code — the published consensus across multiple agency sources is two to three weeks for a comprehensive audit, broken down roughly as: onboarding and orientation (days 1–2), automated scans plus initial manual review (days 3–7), deep dives into architecture, security, and performance (days 8–12), and report writing with prioritised recommendations (days 13–15).

The automated scan alone — running PHPStan, Enlightn, SonarQube, composer audit, npm audit, and PHPMD — takes hours, not days. What takes the remaining time is the expert interpretation: triaging false positives, assessing business impact, evaluating architectural decisions in context, and producing a report that a non-technical stakeholder can act on.

Globally, only three firms publish exact pricing for Laravel audit services. A US-based specialist offers video walkthroughs at $2,500 and written reports at $3,500, with 3–5 business day turnaround. A dedicated Laravel security reviewer charges a flat $2,500 regardless of application size. A vibe-code audit firm offers free initial assessments with full transformations at $10,000–$20,000. No UK agency publishes a fixed price for a Laravel code audit. Every UK firm reviewed — including Official and Platinum Laravel Partners in Birmingham, Southampton, and Edinburgh — requires a consultation before quoting.

The documentation gap: what should exist but usually doesn't

Taylor Otwell, Laravel's creator, made an observation on the Maintainable.fm podcast in 2025 that frames the documentation problem precisely: "The Laravel apps that age best are the ones that don't get too clever — because the clever dev always moves on." He called "cleverness" a code smell and warned against what he described as "cathedrals of complexity."

Laravel's convention-over-configuration design is supposed to help here. When developers follow the framework's conventions — default folder structure, Eloquent patterns, standard routing — an inherited codebase is significantly easier to understand because the next developer knows where things should be. Christoph Rumpel, who runs the State of Laravel survey, made the point directly in March 2026: "Laravel's opinionated nature — the thing some people used to criticise — turns out to be its biggest strength." Jason McCreary, creator of Laravel Shift, has upgraded over 20,000 Laravel applications and consistently advises: "Keep the default folder structure." He reports that developers who create custom folder structures "eventually regretted it."

But convention-over-configuration only helps when conventions are followed. When they're not — custom folder structures, raw SQL bypassing Eloquent, unnecessary abstraction layers, "clever" patterns that deviate from framework idioms — the inherited codebase becomes harder to understand than a non-framework application, because the next developer expects conventions and instead finds deviations.

The documentation that should exist for any custom business application:

A README with setup instructions and architecture overview. Deployment procedures — not "deploy to production" but the actual commands, in order, with environment variables documented. Architecture diagrams at minimum covering context and container levels. An explanation of business logic decisions — the "why," not just the "what." A record of all credentials, API keys, and third-party service accounts. Known technical debt and architectural risks. Known bugs and workarounds.

In practice, most inherited codebases have none of this. Only 58% of organisations actively maintain documentation, while 73% of developers cite poor or incomplete documentation as their primary obstacle when working with existing code (Augment Code). The 15-percentage-point gap between how many organisations think they have adequate documentation and how many developers agree is, in itself, a finding.

The audit as a decision point

An audit is not maintenance. It is a one-time assessment that produces a decision: what to fix, in what order, and whether ongoing maintenance makes sense or whether more fundamental work is needed first.

The severity framework above gives you the prioritisation logic. Red flags get fixed before amber. Amber before grey. If your audit reveals that the codebase needs significant work — a major version upgrade, architectural restructuring, security hardening — that's a separate project, not a retainer. Trying to fix fundamental problems within a maintenance budget is how retainer relationships fail.

Once the critical issues are resolved and the application is in a maintainable state, the economics shift entirely. A maintenance retainer at £450 per month costs £5,400 per year. That is less than a single average cyber breach for a UK SME, a fraction of a single emergency remediation project, and roughly one-fiftieth of a full rebuild.

What to do next

If you've read this far, you're in one of two positions.

You're a technical lead who now has a clear picture of which tools to run, what benchmarks to measure against, and how to classify findings by severity. The tools are free. The time investment for an automated baseline — PHPStan, Enlightn, composer audit, npm audit, PHPMD, and a SonarQube scan — is a single afternoon. What you learn in that afternoon will tell you whether the inherited codebase is fundamentally sound, in need of targeted remediation, or a candidate for more serious intervention.

Or you're a non-technical business owner who now has a checklist you can run today, a framework for understanding what a professional audit should cover, and the vocabulary to have an informed conversation with whoever you commission to do the work. You know what a red flag looks like versus cosmetic noise. You know what questions to ask. You know what documentation should exist and usually doesn't.

Either way, the audit is step one. What happens after the audit — whether that's targeted fixes, a scoped project, or structured ongoing support — depends entirely on what the findings reveal.

I'm Anatoly Silko, founder of Rocking Tech — a UK-based agency that builds and maintains production Laravel platforms. If you've inherited a codebase and want to know what you're working with, the original version of this article has more detail on next steps.

Why Your Vibe-Coded App Keeps Breaking Every Time You Fix Something

Anatoly Silko — Sat, 02 May 2026 17:46:11 +0000

You ship a working feature on Monday. Tuesday morning a small cosmetic bug appears. You ask the AI to fix it. The fix works but the login flow now throws a 500. You paste the stack trace back in. The login comes back, but the payment webhook is silent. Eight prompts later, the UI is half-broken in three new places and you can no longer remember which version of the app actually worked.

This is not bad luck, and it is not a skill issue. It is the predictable output of four compounding technical limitations in how today's AI coding agents understand code. By the time a founder is 30 prompts deep and watching previously-working features vanish, the tool has quietly lost the thread: the context window is full, the generation is non-deterministic, the agent has no map of what depends on what, and it has been patching symptoms rather than causes.

The loop is escapable. The way out is architectural, not another prompt.

The loop has a signature, and it's in the literature

The pattern has a name in practitioner forums — "the doom loop" — and a clean mechanism in the research. A commenter on a 2025 Hacker News thread analysing the architecture behind Lovable and Bolt put it plainly: "I've tried several proof of concepts with Bolt and every time just get into a doom loop where there is a cycle of breakage, each 'fix' resurrecting a previous 'break'" (Hacker News, July 2025). A developer migrating off Lovable in r/vibecoding was more direct: "When the project advances, it ruins your existing code" (r/vibecoding, 2025).

Neither of those is complaining. Both are describing the same mechanism, and that mechanism is measurable. Four things are happening inside the agent at once, and they compound.

1. Context rot: the window fills up and the model forgets

Every frontier model — Claude, GPT-5, Gemini 2.5 Pro, Opus 4.x — is advertised with a huge context window of 200K or 1M tokens. The window is real. The useful window is much smaller.

Chroma Research's Context Rot study of 18 frontier models (including GPT-4.1, Claude 4, Gemini 2.5, Qwen3) found that "models do not use their context uniformly; instead, their performance grows increasingly unreliable as input length grows", even on trivial repeat-the-string tasks (Chroma Research, July 2025). The NoLiMa benchmark tested 12 long-context LLMs and found 10 of them dropped below 50% of their short-context baseline at just 32K tokens; GPT-4o fell from a 99.3% baseline to 69.7% at 32K (NoLiMa, arXiv 2502.05167, 2025). Anthropic's own engineering team frames the same phenomenon as an operating constraint: "context must be treated as a finite resource with diminishing marginal returns" (Anthropic Engineering, 2025).

Every time you paste the error log, the previous reply, the file in question, and "also don't break X" back into the chat, you are pushing real signal deeper into a window where the model is demonstrably less able to use it. The fix arrives having forgotten the invariant you told it about two prompts ago — because, functionally, it has.

2. Non-determinism: the same prompt does not return the same code

Founders often assume a regenerated fix is a better version of the same attempt. It is not. It is a different attempt.

An empirical study in ACM Transactions on Software Engineering and Methodology ran ChatGPT through 829 coding problems across three benchmarks, five times each. The proportion of tasks producing zero identical test outputs across the five runs was 75.76% on CodeContests, 51.00% on APPS, and 47.56% on HumanEval — and setting temperature to zero reduced the effect but did not eliminate it (Ouyang, Zhang, Harman & Wang, ACM TOSEM, 2024). A follow-up study of five LLMs across eight tasks measured up to 15% accuracy variance and a best-versus-worst gap of 70% at nominally deterministic settings (Atil et al., arXiv 2408.04667, 2024, revised 2025; published at Eval4NLP 2025). A further paper attributed much of the residual randomness to GPU type, GPU count, and batch size — meaning the infrastructure your request happens to land on changes the code you get back (Yuan et al., arXiv 2506.09501, June 2025; NeurIPS 2025).

When your first fix breaks two other things and you reflexively re-prompt "try again", you are not asking the same question twice. You are rolling a fresh die on different hardware. Each attempt can legitimately produce a different architecture, different variable names, and different side effects.

3. No blast radius: the agent edits without a map

This is the single most important mechanism, and the least understood by non-technical founders. The popular AI coding agents do not build a language-aware, cross-file dependency graph before editing your code. They rely on embedding retrieval plus grep/glob search. Anthropic's own engineering post on context is explicit about the approach: their agent design uses "primitives like glob and grep" to navigate codebases, rather than indexing a syntax tree (Anthropic Engineering, 2025). Translation: the agent searches for text that looks related. It does not traverse a real call graph.

The measurable cost is large. Microsoft Research's CodePlan study ran repository-level coding tasks through GPT-4 with and without an explicit dependency graph. The graph-aware planner passed validity checks on 5 of 6 repositories; the identical LLM without the planning graph passed 0 of 6 (Microsoft Research, CodePlan, 2024).

Without a call graph, the agent cannot see that the authentication helper it is rewriting is invoked by four controllers, two Artisan commands, and a queued job. Its "fix" passes the one test it can see and silently breaks three callers you'll only notice in the next prompt cycle. Every additional prompt widens the blast radius the agent never computed.

4. Shallow debugging: patching the symptom, not the cause

LLM coding agents preferentially address the most recently-quoted error line rather than the upstream cause. The SWE-Bench+ audit manually reviewed 251 successful GPT-4 patches and found that 31.08% passed only because of weak test cases — "plausible patches" that were semantically wrong (Aleithan et al., SWE-Bench+, arXiv 2410.06992, 2024). The same work strengthened the test suites and re-ran the leading coding agents: the average resolution rate on SWE-Bench Verified fell from 51.7% to 25.9% once tests could no longer be satisfied by plausible but wrong patches (Aleithan et al., SWE-Bench+, 2024).

The more specifically you paste the error text, the more literally the agent locks onto that line — often wrapping a try/except around it, or special-casing the offending input. That passes the next run, then breaks something one level up the call stack. Because the "plausible patch" passes the test visible to both you and the agent, each prompt cycle satisfies the immediate failure while quietly accumulating debt. That is exactly the fix-break-fix signature.

Put the four together and you have a system that forgets your constraints as the conversation grows, hands you a non-deterministically different attempt each time, edits without knowing what depends on what, and optimises for the visible error rather than the real one. The loop is the emergent behaviour.

The loop is not a prompting problem. It is a diagnostic problem being managed with a prompting tool.

What the loop costs while it's happening

The loop has a price, and the price is metered. Tool pricing across the major platforms has shifted through 2025–26 in ways that make regression cycles disproportionately expensive.

Lovable's Pro plan allocates 100 credits on $25 per month. Its "Try to fix" button is officially free, but every Agent-mode prompt that follows is not (Lovable documentation, 2025). One published review logged the exact dynamic: "Lovable attempts a fix, introduces a new bug, attempts to fix that, creates another issue. I watched it burn 12 credits in one loop before I intervened manually. For reference: my Pro plan's 100 monthly credits lasted exactly 14 days at my usage rate" (ohaiknow review, 2026).

Cursor restructured its $20 Pro plan on 16 June 2025, moving from "500 fast responses plus unlimited slower responses" to a $20 API-priced usage budget per month, with overages requiring manual top-ups. The backlash was severe enough that Anysphere CEO Michael Truell issued a public apology on 4 July 2025: "We recognize that we didn't handle this pricing rollout well and we're sorry. Our communication was not clear enough and came as a surprise to many of you" (Truell, Cursor blog, 4 July 2025).

Replit Agent 3, launched 10 September 2025, produced the sharpest spike. Within days, The Register was reporting users going from $100–250 per month to over $1,000 in a single week, quoting one user directly: "editing pre-existing apps seems to cost most overall — I spent $1k this week alone" (The Register, 18 September 2025). Replit's checkpoints bill regardless of outcome, spending caps are not configured by default, and usage-based charges are non-refundable within the 30-day evaluation window (The Register, September 2025).

The pattern across the tools is identical. Regression loops are among the most expensive failure modes because each iteration carries full generation cost and produces a non-deterministic attempt that may require another. The tools that advertise "try to fix — free" tend to move the cost to the next message. The ones that don't just bill you twice.

A separate article in this cluster covers the broader rescue economics when the tool-level spend spills into freelancer and agency fees; this piece stays focused on what's happening inside the code.

What the moment actually sounds like

What makes the loop particularly corrosive is how it presents emotionally. The research describes context rot and non-determinism as statistical properties. Founders experience them as betrayal. A handful of specific moments recur across reviews, forums, and Reddit threads from late 2025 and early 2026.

There is the moment of realisation. A three-month Cursor review on r/CursorAI captured the productivity-negative version: "Without CursorAI: a MVP-project takes 1 week. With CursorAI: the same project still takes 7 days — plus another 3 weeks to clean up the mess it introduced" (r/CursorAI, April 2025). The realisation is never that the AI is stupid. It's that persistence has stopped paying.

There is the confidently-wrong fix. One published Lovable review described the loop almost as a numbered protocol: "1. You ask lovable to fix the Problem. 2. Lovable will tell you that the issue is now fixed. 3. You realize, its not. 4. start at 1" (Fact Checker review of Lovable, 2026). The AI's confidence is the reason the loop continues. If it said "I'm not sure", the founder would stop.

There is the disappearing feature. A Cursor user wrote on the official forum: "Cursor has started editing the wrong files, breaking parts of my codebase unintentionally. On a few occasions, it has even deleted files entirely" (Cursor forum, 2025). These are not rare incidents. They are the logical consequence of blast-radius blindness.

And there is the decision to hire a human. A post in r/VibeCodeDevs captured the exact language founders use at the inflection point: "I've actually already built a 70%-there prototype using Lovable, though it took me around 5 hours and was a somewhat frustrating experience. I also have no coding background… I'd love to hire someone to build it for me, but places like Upwork hardly have anyone using AI tools. It's hard to pay a traditional dev agency for 2 weeks of dev work knowing that I already made a version with most of the features in a few hours with no experience" (r/VibeCodeDevs, May 2025).

Note the particular shape of that frustration. It is not a rejection of vibe-coding. It is a recognition that the 70-to-100% gap needs a different skill.

The common thread is worth marking. Founders don't describe the tools as stupid or broken. They describe themselves as stuck, as uncertain whether progress is being made, as unsure what changed. That's the rational response to a system that is non-deterministically rewriting their code without a dependency graph while losing track of constraints — but experienced from the outside, it feels like a personal failing. It isn't.

What actually works: the architectural intervention

The loop cannot be prompted out of. You cannot context-engineer your way past context rot, and you cannot instruct an agent to build a dependency graph it doesn't maintain. Escaping the loop requires treating the vibe-coded codebase as what it actually is — an untrusted, undocumented legacy application that happens to be a week old — and applying the same diagnostic tools an engineer uses on any inherited codebase.

Three things need to happen, in order.

Static analysis with real thresholds. The single most useful tool for most vibe-coded repos is jscpd — the copy-paste detector — because the dominant defect in AI-generated code is duplication. A GitClear analysis of 211 million lines of code authored between 2020 and 2024 found that copy-pasted code rose from 8.3% to 12.3% of all changes, duplication blocks increased roughly eightfold, and the share of refactored ("moved") lines fell from 24.1% in 2020 to 9.5% in 2024 (GitClear, AI Copilot Code Quality, 2025). For Laravel back-ends, PHPStan with the Larastan extension catches AI-invented Eloquent methods the human reader doesn't notice; Psalm's taint analysis catches unsanitised input flowing to SQL and shell sinks, which AI-generated controllers miss routinely. SonarQube's published AI Code Assurance guidance recommends tightening the default quality gates specifically for AI-generated code (Sonar, 2025).

A dependency-graph pass the agent was never able to do. For a React or TypeScript vibe-coded app, madge --circular src/main.ts and madge --orphans immediately surface the circular imports and dead files agent edits leave behind. For a Laravel back-end, php artisan route:list separates routes that actually serve traffic from the duplicate CRUD scaffolding AI agents create (/users/list and /users/index, pointed at the same controller, happens more often than is comfortable). What this gives you is the call graph the agent never had, which in turn tells you which files are genuinely load-bearing and which are decoration. The mechanics are covered in more depth in our Laravel inheritance audit guide.

Patch-versus-rewrite, decided on evidence. Two recently-published engineer teardowns show what this looks like in practice. Eric J. Ma's "Undoing AI vibe-coded slop with AI" (29 March 2026) documents the refactor of canvas-chat, a project he had built with heavy AI assistance. His own framing: "a functional, but tangled, 8,500-line app.js monolith" that he then wrangled "into a clean, modular plugin system" (Ma, 2026). His conclusion is diagnostic: "The AI could add features, fix bugs, and split files when prompted. But it couldn't see the latent architecture — the system that would make the whole thing maintainable" (Ma, 2026). The mechanism he describes is the same one in the literature above — the AI executes architecture, it does not design it.

A separate review of six unrelated vibe-coded codebases found near-identical patterns: four of the six had authorisation gaps where API routes authenticated users but did not authorise them against specific resources; five of the six had the same generic try/catch block swallowing every async failure — the exact pattern catch (error) { console.error(error); return { error: "Something went wrong" } }, repeated in nearly every async function (JSGuruJobs, I Reviewed 6 Vibe Coded Codebases, dev.to, 12 February 2026). (We covered the security side of this pattern separately.)

What both teardowns make clear is that the intervention is not "fix the bugs". It is imposing the architecture the AI could not see — typically by extracting pure functions, isolating feature modules behind explicit interfaces, and introducing an end-to-end test suite first so the subsequent refactor (even an AI-assisted one) cannot silently regress. Martin Fowler's Strangler Fig pattern, originally from 2004, applies almost without modification: front the vibe-coded monolith with a thin facade, route one endpoint at a time through a properly-architected replacement, and retire the original incrementally (Fowler, 2004). The pattern works because it replaces non-deterministic edits-in-place with deterministic route-level migrations that can be verified.

The rewrite-versus-patch test, honestly

The hardest conversation with a founder in the loop is not technical. It is whether to keep patching at all.

The evidence-based answer comes from the diagnostics above, not intuition. A patch is defensible when duplication is low and localised, there is a coherent module boundary still visible in the code, the authorisation gaps are contained to a small set of endpoints, and the test suite can be backfilled around the existing shape. That profile fits most first-month Lovable and Bolt apps, and a disciplined refactor usually lands in 1–3 weeks.

A rewrite is cheaper when duplication is pervasive, the call graph shows no coherent module boundaries, there is no test suite to anchor a refactor, and the authorisation model needs redesigning rather than patching. That profile fits most third-month vibe-coded apps — and specifically, most apps where the founder has been in the loop long enough to get here.

The signal to look for is not how broken the app feels. It is how much of the code the diagnostics say is load-bearing versus decorative. When Madge and route:list tell you 40% of the code is unreachable, patching it is expensive nostalgia.

What to do this week

If you are currently 20 prompts deep into a fix-break-fix cycle, the single most valuable hour you can spend is not another prompt. It is running three diagnostics against your repo — jscpd for duplication percentage, madge --circular for cyclic and dead-file imports, and your framework's equivalent of php artisan route:list — and writing down the three numbers they produce.

Those three numbers will tell you more about whether you are one prompt or one rewrite away from a working app than another £200 of credits will.

The loop is not a prompting problem. It is a diagnostic problem being managed with a prompting tool. Once you have the diagnosis, the intervention — patch, refactor, or Strangler Fig rewrite — is a decision you can make with numbers instead of frustration.

More prompting makes AI code regression loops worse because four compounding mechanisms — context rot, non-determinism, absent dependency graphs, and symptom-patching bias — each make every additional prompt more likely to break something working rather than less. The literature now measures all four. The cost is in credits for as long as you stay in the loop, and in unshipped weeks for as long as you stay in denial about the loop.

The exit is architectural. Diagnose with static analysis and dependency tools first. Decide patch-versus-rewrite on evidence. Then intervene with the discipline the AI was never going to supply itself.

Vibe coding built the app. It cannot, on current evidence, finish it — not because it's stupid, but because it's blind to exactly the thing that now matters most: the shape of what you already have.

I'm Anatoly Silko, founder of Rocking Tech — a UK Laravel agency that builds production platforms, increasingly from AI-generated starting points. The original version of this article has more detail on next steps if you've hit the wall yourself.

Your Lovable App Hit a Wall — Here's What to Do Next

Anatoly Silko — Wed, 15 Apr 2026 10:56:35 +0000

Security firms audited thousands of Lovable, Bolt.new and Cursor apps. The same three failures appear in nearly every one — most are fixable without starting over. Here's what actually goes wrong, what the research shows, and how to think about whether to patch, refactor, or rebuild.

This is not a rare experience. Escape.tech scanned 5,600 publicly deployed vibe-coded applications (October 2025) and found over 2,000 vulnerabilities, more than 400 exposed secrets, and 175 instances of exposed personal data — including medical records and bank account numbers. A separate study by Tenzai built fifteen identical test apps across five leading AI coding tools and found 69 vulnerabilities (CSO Online, December 2025). Not one of the fifteen apps had CSRF protection. Not one had rate limiting on login. Not one set security headers.

These are not edge cases. They are the default output.

This article explains what actually goes wrong — architecturally — when an AI tool builds your application. Not to make you feel bad about it. The tools are genuinely useful for prototyping, and the work you did has real value. But prototyping tools produce prototyping code, and the gap between "works in preview" and "works in production" is specific, predictable, and well-documented.

The database is wide open — and the tools don't tell you

The single most dangerous pattern in vibe-coded applications is a misconfigured database. If you built with Lovable or Bolt.new, your app almost certainly uses Supabase as its backend. Supabase is a solid product. The problem is not Supabase itself — it is what happens when AI generates the connection between your app and the database without implementing the security layer that Supabase requires you to configure manually.

That security layer is called Row Level Security, or RLS. It controls which users can read, write, and delete which rows in your database. Without it, anyone who knows your Supabase URL — which is visible in your app's JavaScript — can query your entire database directly. Not theoretically. Literally.

In May 2025, a security researcher scanned 1,645 applications from Lovable's own showcase and found that 170 of them — 10.3% — had critical RLS failures (CVE-2025-48757, CVSS 8.26). The data exposed included names, email addresses, phone numbers, home addresses, and financial records including personal debt amounts.

Independently, an engineer at a major technology company reproduced the attack during a lunch break. Using fifteen lines of Python and forty-seven minutes of effort, he extracted personal data and API keys from multiple Lovable showcase sites.

The problem continued into 2026. In February, a researcher found sixteen vulnerabilities — six of them critical — in a single educational app featured on Lovable's Discover page (The Register, February 2026). That app had over 100,000 views. It exposed 18,000 users including students and educators at multiple US universities: 14,928 email addresses, 4,538 student accounts, and 870 records with full personally identifiable information.

The most dramatic documented case is Moltbook, an AI social network whose founder stated publicly that he wrote no code — the entire platform was vibe-coded. In January 2026, Wiz Research discovered that a Supabase API key exposed in client-side JavaScript, combined with RLS completely disabled, granted full read and write access to the production database. The breach exposed 1.5 million API authentication tokens (for services including OpenAI, Anthropic, AWS, and Google Cloud), 35,000 email addresses, approximately 4,000 private messages, and 4.75 million database records in total.

At scale, the SupaExplorer project scanned 20,000 indie launch URLs (January 2026) and found that 11% expose Supabase credentials in their frontend code, with a significant portion containing service_role keys — keys that bypass all RLS entirely, granting unrestricted database access.

Bill Harmer, CISO of Supabase, has stated publicly that Row Level Security is "simple, powerful, and too often ignored." Supabase has since published dedicated resources for vibe coders, including a master security checklist and AI-specific prompts for generating RLS policies. But the tools that generate the code still do not enforce these policies by default.

If you are running a Supabase-backed application built with AI tools, checking your RLS configuration is not optional. It is the single most urgent thing you can do.

Authentication that works for one user in preview — and breaks for everyone in production

The second consistent failure pattern is authentication. Not the absence of authentication — most vibe-coded apps have a login screen. The problem is that the authentication implementation is shallow. It works when you test it yourself, in a single browser, with one account. It breaks under every real-world condition: multiple simultaneous users, token expiry, session handling across devices, password reset flows, and rate limiting.

Lovable defaults to Supabase Auth. Bolt.new uses Supabase Auth or its own database layer. Cursor generates whatever auth pattern the prompt suggests, with no enforced standard. Vercel's v0 generates no backend logic at all — it is purely frontend.

Dynamic application security testing by a major security firm (Bright Security, November 2025) revealed what these defaults actually produce when deployed. Testing identical forum-style apps generated by four leading AI tools, they found broken authentication enabling user impersonation, missing access control, no rate limiting (meaning brute-force attacks face no resistance), and weak session handling — across every platform tested. One tool's own built-in static scanner reported zero vulnerabilities in the same codebase that dynamic testing found to contain four critical and one high-severity flaw. The internal scanner was checking syntax. The security firm was testing behaviour.

A particularly well-documented example: Lovable generates an asynchronous callback inside Supabase's onAuthStateChange() listener that makes database calls during the authentication flow. Supabase's own documentation explicitly warns against this pattern — it causes deadlocks. The app freezes completely after login. A developer documented this bug publicly (Tomás Pozo, 2025) and reported that Lovable's AI attempted six separate fixes without identifying the root cause, repeatedly adjusting loading states instead of recognising the async callback issue.

A study of 100 vibe-coded apps (VibeWrench, March 2026) confirmed the pattern at scale: 70% lacked CSRF protection, 41% had exposed secrets or API keys, 21% had no authentication on API endpoints, and 12% had exposed Supabase credentials. The Tenzai study — fifteen test apps, five tools — independently confirmed: zero had CSRF protection, zero had login rate limiting, and zero set Content Security Policy headers. Every single tool introduced Server-Side Request Forgery vulnerabilities.

The most instructive public case involved a SaaS founder who built his entire product with Cursor and deployed it without handwritten code. The AI placed all security logic in frontend JavaScript. Within seventy-two hours of launch, users bypassed all payment restrictions by changing a single value in the browser console. The founder publicly announced the shutdown, writing: "I shouldn't have deployed unsecured code to production."

The pattern is consistent: AI tools generate authentication that looks correct — a login form, a session token, a protected route — but omits the enforcement layer. There is no server-side validation. There is no token refresh logic. There is no protection against automated attacks. The login screen is a door with a lock but no deadbolt. It stops honest people. It does not stop anyone who tries the handle.

Code that nobody — including the AI — can maintain

The third failure is structural. It does not cause a security breach or a crash. It causes something slower and more corrosive: the codebase becomes unmaintainable. Every fix introduces a new bug. Every new feature takes longer than the last. The AI starts contradicting its own earlier decisions. You are not imagining this. It is a documented, measurable phenomenon.

CodeRabbit analysed 470 GitHub pull requests (December 2025), comparing AI-generated code against human-written code. AI-co-authored code contained 1.7 times more major issues per pull request, with approximately eight times more excessive I/O operations. The single biggest difference across the entire dataset was readability — AI code that technically works but that no human (and often no subsequent AI session) can efficiently understand or modify.

Faros AI tracked over 10,000 developers across 1,255 teams (2025) and found that developers using AI tools completed 21% more tasks and merged 98% more pull requests. That sounds positive until you see the other side: pull request review time increased by 91%. The bottleneck shifted from writing code to reviewing code — and much of the review time was spent untangling AI decisions that made no architectural sense.

The dependency problem compounds this. Endor Labs analysed 10,663 GitHub repositories (November 2025) and found that only one in five dependency versions recommended by AI coding assistants were safe — neither hallucinated (pointing to packages that do not exist) nor containing known security vulnerabilities. Between 44% and 49% of dependencies imported by AI agents contained known vulnerabilities. Your app may technically run, but the libraries it relies on are a minefield.

At the code level, one practitioner who runs weekly audits of vibe-coded apps published a sample scoring 62 out of 100 — a "Caution" rating (Beesoul, January 2026). Specific findings included 47 database calls per single page request, admin routes accessible without valid session tokens, and search functions with no input sanitisation. In a SaaS startup built with Cursor, a live Stripe secret key was embedded directly in a React payment component — visible to anyone who opened browser developer tools.

The same auditor estimates that only about 10% of vibe-coded apps pass a clean audit. The ones that do "usually involve a technical co-founder" who understood the output well enough to catch and correct the AI's mistakes before deployment.

A dual-model audit experiment (Building Burrow, January 2026) ran a vibe-coded project through two leading AI models simultaneously. Both flagged issues. Then a human engineer reviewed the same codebase and found "a lot of very basic issues that were overlooked" by both models — including violations of the Single Source of Truth principle (competing state stores managing the same data), copy-paste code where shared utilities should exist, and significant dead code including deprecated functions and unused exports that inflated the codebase and confused future AI sessions.

Carnegie Mellon University researchers studied 807 GitHub repositories using Cursor (2025) and concluded that AI tools were functionally correct 61% of the time but produced secure code only 10.5% of the time. Their summary: "AI briefly accelerates code generation, but the underlying code quality trends continue to move in the wrong direction."

This is the context behind the "fix one thing, break ten others" experience. It is not randomness. It is architectural debt accumulating faster than the AI can pay it down. Each prompt adds code without integrating it into a coherent structure. The codebase grows, but it does not improve. Eventually, complexity reaches what one auditor calls the "Spaghetti Code Limit" — the threshold beyond which every new feature takes exponentially longer to implement, and every fix introduces new breakage.

What the preview-to-production gap actually looks like

Everything discussed so far — open databases, broken auth, unmaintainable code — exists in your application right now, in development. But the gap widens dramatically when you move from preview to production. Vibe-coding tools generate no CI/CD pipelines, no database migration scripts, no logging or monitoring, and no environment variable management by default.

The most publicly documented production failure involved a well-known SaaS founder whose AI agent wiped data for over 1,200 executives and 1,190 companies from a live database during a designated code freeze (Fortune, July 2025). The agent then fabricated approximately 4,000 fake database records. When confronted, it admitted to running unauthorised commands and "lying on purpose."

At enterprise scale, Amazon disclosed in March 2026 that AI-generated code changes caused two major production incidents within three days (Business Insider, March 2026). The first resulted in approximately 120,000 lost orders due to incorrect delivery times. The second — a production change deployed without formal documentation — caused a 99% drop in orders across North American marketplaces, representing 6.3 million lost orders. Amazon's CTO warned publicly that language models "sometimes make assumptions you do not realise they are making."

These are extreme cases. But they illustrate the same structural problem that affects every vibe-coded app moving from development to production: the AI generates code for a single-user, single-environment context. It does not generate the infrastructure that makes code work reliably across environments, at scale, over time. Empty try/catch blocks swallow errors silently, meaning your app crashes in production with no logs to diagnose the failure. Context retention in AI tools degrades noticeably once projects exceed fifteen to twenty components. And the thousand-user milestone — often the first real stress test — is typically when database queries without pagination, synchronous external dependencies, and absent monitoring become visible simultaneously.

What is actually salvageable — and what the three options look like

The question founders ask most often is: do I need to start over?

Usually, no.

The emerging consensus from practitioners who assess vibe-coded apps professionally is clear: frontend components are largely salvageable. The problems are almost always in the backend — authentication, database design, security, and error handling. A 2026 survey (The New Stack) found that 76% of developers report having to rewrite or refactor at least half of AI-generated code before it reaches production. But "at least half" also means "not all." The frontend — the screens, the layouts, the user interface that you spent weeks refining — is frequently worth keeping.

Your vibe-coded app served a purpose that a blank page never could. It validated your idea with real users. It clarified requirements that no written specification could have captured. It proved that people want what you are building. That is not wasted work. It is the most expensive part of building a product — market validation — accomplished at a fraction of the traditional cost.

The decision framework has three options.

Patch means fixing specific, isolated issues without changing the underlying architecture. This works when the problems are surface-level: a missing RLS policy that can be added, an exposed API key that can be moved to environment variables, a specific authentication bug that can be resolved. Patching is appropriate when the architecture is fundamentally sound and the technical debt is contained. In practice, this applies to roughly 10% of vibe-coded apps — the ones where a technical co-founder caught most issues early, or where the app's scope is genuinely simple.

Refactor means keeping the working parts — typically the frontend and validated business logic — while rebuilding the backend architecture. This is the most common path. The frontend your users already know and use stays intact. The database gets proper schema design, indexing, and RLS policies. Authentication gets server-side enforcement. Error handling gets implemented throughout. The result is the same product, with the same user experience, running on a foundation that can actually handle production traffic. Refactoring typically involves modifying 30–50% of the codebase.

Rebuild means starting the technical implementation from scratch, using the existing application as a living requirements document. This is appropriate when the technical debt exceeds 80% of the codebase — when the architecture is so tangled that fixing individual components would take longer than rebuilding (BayOne, 2026). Even in a full rebuild, nothing is truly lost: validated user flows, design patterns that work, business logic that users have confirmed, and the market understanding you gained are all preserved. The rebuild is faster and more accurate than building from a written specification, because you have a working prototype to reference instead of a document to interpret.

The critical point: you cannot determine which option is right without assessing what is actually in the codebase. An AI tool will not give you an honest answer — its incentive is to keep generating fixes. A quick-fix freelancer will not give you a structural answer — their incentive is to bill hours on individual bugs. The assessment itself is the first step.

The tools are not the villain. The gap is the gap.

Collins Dictionary named "vibe coding" its Word of the Year for 2026. Cursor has crossed a million daily active users (Contrary Research, December 2025). Bolt.new added five million registered users in its first five months. Replit now claims over fifty million accounts and has generated nine million complete applications (Forbes, March 2026). These tools are not going away, and they should not. They have democratised the ability to build and test ideas at a speed that was unimaginable three years ago.

But prototyping tools produce prototyping code. That is not a criticism — it is a description. The same way a sketch is not a blueprint, a vibe-coded app is not a production system. The sketch has value. The blueprint has different value. The gap between them is specific, measurable, and closable.

The research is unambiguous on what that gap contains: misconfigured database security, shallow authentication, unmaintainable code structure, and absent deployment infrastructure. These are not random failures. They are the predictable output of tools optimised for speed of generation rather than reliability of operation. Understanding this means you can stop blaming yourself for hitting the wall — and start making a clear-eyed decision about what to do next.

Your AI-Generated Code Isn't Secure — Here's What We Find Every Time

Anatoly Silko — Sat, 04 Apr 2026 22:52:43 +0000

Veracode tested 150+ AI models and found 45% of generated code introduces OWASP Top 10 vulnerabilities. The failure rate for cross-site scripting defences is 86% — and it isn't improving with newer models. Here's what that looks like inside a real codebase, what you can check yourself in 30 minutes, and what the UK's National Cyber Security Centre is now saying about it.

If you built something with Lovable, Bolt.new, Cursor, Replit, or v0 — and it's live, or about to be — six specific security problems are almost certainly sitting in your codebase right now.

That's not opinion. It's the consistent finding across every major independent security study published in the past twelve months: Veracode's 150-model benchmark, DryRun Security's assessment of three leading AI agents, Apiiro's scan of 62,000 enterprise repositories, and a Georgia Tech research team tracking real vulnerabilities in real time. The tools write code that runs. They don't write code that's safe.

This article gives you the practitioner's view: what the six problems are, how to check for them yourself in 30 minutes using free tools, what the UK's own National Cyber Security Centre said about it, and what the independent research actually found.

The six things we find in every assessment

The same six security failures appear in virtually every AI-generated codebase. They're not exotic exploits — they're the security equivalent of leaving the front door unlocked. And they're the first things attackers look for because they're the easiest to find.

1. Your secret keys are in the code anyone can read

When you tell an AI tool to "connect to Stripe" or "add OpenAI," it pastes the secret key directly into a JavaScript file that ships to every user's browser — visible to anyone who opens developer tools.

GitGuardian's 2026 analysis of public GitHub found 28.65 million new hardcoded secrets pushed in 2025 — a 34% increase year-on-year (GitGuardian, State of Secrets Sprawl 2026). AI-assisted commits leaked secrets at 3.2% versus the 1.5% baseline: more than double the rate. Supabase credential leaks specifically rose 992%.

A SaaS founder who built his entire product with Cursor was attacked within days of sharing it publicly. Attackers found his exposed API keys, maxed out his usage, and ran up a $14,000 OpenAI bill. He shut down permanently.

If your Stripe secret key is in your frontend code, anyone can issue refunds to themselves. If your OpenAI key is exposed, anyone can run your API credits to zero overnight.

2. User input goes straight to the database without checks

AI generates the shortest path to working code. That means pasting user input directly into database queries instead of using parameterised queries — the standard defence against SQL injection that has existed for over twenty years. It also means rendering user-submitted text without escaping it, creating cross-site scripting vulnerabilities.

Veracode found an 86% failure rate on XSS defences across all 150+ models tested — with no improvement in the latest generation (Veracode, GenAI Code Security Report, July 2025). These are among the oldest and most exploited vulnerabilities on the internet, and AI tools are reintroducing them at industrial scale.

3. Your APIs have no speed limit

An API without rate limiting is an open invitation. Attackers can try thousands of passwords per second. Competitors can scrape every record. Bots can flood expensive AI features and run up cloud bills.

DryRun Security's March 2026 study found the most telling detail: rate limiting middleware was defined in every codebase. The AI wrote the code for it. But not a single agent actually connected it to the application. The safety net existed in the files — it just didn't work (DryRun Security, Agentic Coding Security Report, March 2026).

4. File uploads accept anything

When AI builds an upload feature — profile pictures, documents, attachments — it saves whatever file the user provides without checking the type, size, or filename. This opens the door to uploading executable scripts, overwriting server files, or crashing the application with oversized files.

JFrog's research found that even when the AI does add file validation, it generates naive checks that block only the most literal attack patterns and can be bypassed with encoding or absolute paths (JFrog, Analyzing Common Vulnerabilities Introduced by Code-Generative AI).

5. No browser-level security headers

Every modern browser supports security headers — single-line configuration directives that control which scripts can run, whether to force HTTPS, and whether the site can be framed. Content-Security-Policy, Strict-Transport-Security, X-Frame-Options. AI tools never add them.

In the Tenzai study — fifteen apps built by five major AI coding tools — not one set any security headers. Zero out of fifteen (Tenzai, Secure Coding Comparison, December 2025).

6. Server-side request forgery on every URL feature

When AI builds a feature that fetches data from a URL — link previews, image proxies, webhooks — it makes the server request whatever URL the user provides, including internal cloud metadata endpoints that expose full infrastructure credentials.

The AppSec Santa 2026 study found SSRF was the single most common vulnerability across all six models tested, with 32 confirmed findings (AppSec Santa, AI Code Security Study, 2026). The Capital One breach — 100 million records, an $80 million fine — started with exactly this vulnerability class.

How to check yours in the next 30 minutes

You don't need a developer for this. The checks below use free, public tools and take less than 30 minutes combined. They won't catch everything, but they'll tell you whether you have an immediate problem.

Check 1: Security headers

Visit securityheaders.com or Mozilla HTTP Observatory. Enter your URL. You'll get a letter grade from A+ to F. If you score D, E, or F, your app is missing critical browser-level protections. Most vibe-coded apps score F.

Check 2: Exposed secrets in source code

In Chrome, press Ctrl+U (Cmd+Option+U on Mac) to view page source. Search for: sk_live (Stripe secret key), sk- (OpenAI), AKIA (AWS), password, secret, api_key. Public keys like Stripe's pk_live_ are expected. Secret keys should never appear in frontend code.

Check 3: Exposed .env file

Type your domain followed by /.env — for example, https://yourapp.com/.env. If you see anything other than a 404 page, your secrets file is publicly accessible. This is a critical emergency. Also try /.env.local and /.env.production. A 2024 Palo Alto Networks campaign exploited .env files across over 110,000 domains.

If you type your domain followed by /.env and see database passwords instead of a 404 page, stop reading this article and fix that first.

Check 4: Supabase database security

If your app uses Supabase, log into the Dashboard → Database → Security Advisor. Look for check 0013: "RLS disabled in public." If any table shows Row Level Security disabled, anyone on the internet can read the entire contents using nothing more than the URL visible in your app's JavaScript.

Check 5: SSL certificate

Visit ssllabs.com/ssltest and enter your domain. Takes two minutes. Most modern hosting should give an automatic A. Anything below that indicates a misconfiguration.

Check 6: Debug mode in production

Visit a non-existent page on your site — something like /this-does-not-exist-12345. If you see file paths, stack traces, or database details instead of a simple 404, debug mode is enabled. This exposes your application's internals to anyone who triggers an error.

Check 7: What Google has indexed

Type site:yourapp.com into Google. Then try site:yourapp.com inurl:admin for exposed admin panels, or site:yourapp.com filetype:env for indexed secrets files. Any result you didn't expect to be public shouldn't be.

What the UK government said

On 24 March 2026, NCSC CEO Richard Horne addressed vibe coding directly at the RSA Conference. The companion blog post by NCSC CTO Dave Chismon described AI-generated code as presenting "intolerable risks" for many organisations and warned that within five years it will become common to see AI-written code in production that a human has never reviewed (NCSC, "Vibe check: AI may replace SaaS (but not for a while)," March 2026).

That phrasing — "intolerable risks" — came from the UK government's own cybersecurity authority. Not a vendor. Not a consultant. The NCSC.

What the ICO expects from you

Existing obligations under Article 32 of UK GDPR — requiring "appropriate technical and organisational measures" to protect personal data — are technology-neutral. The ICO does not distinguish between human-written and AI-generated code when assessing whether your security is adequate.

The enforcement record makes the consequences concrete. Advanced Computer Software Group was fined £3.07 million in March 2025 for failing to implement multi-factor authentication, vulnerability scanning, and adequate patch management — exactly the kinds of controls AI-generated code consistently omits. No UK business has yet been fined specifically for a breach caused by AI-generated code. But the vulnerabilities the ICO penalises are precisely what every study cited in this article finds in vibe-coded applications.

What your insurer may not cover

42% of UK organisations report their cyber insurance policy now specifically excludes liabilities associated with AI misuse (SecurityBrief UK, 2025–2026). If your app is built with AI tools and your insurer doesn't know, your coverage may not be what you think it is.

The research behind the numbers

Everything above rests on independent research with disclosed methodology and large sample sizes.

Veracode: 150+ models, 80 tasks, 45% failure rate. Java had the worst at 72%. XSS defences failed in 86% of samples. Model size made no meaningful difference.

Apiiro: 62,000 repos across Fortune 50 enterprises. AI-assisted developers introduced 10,000+ new security findings per month by mid-2025 — a tenfold increase. Privilege escalation paths jumped 322%.

DryRun Security: Three AI agents, two apps each, 30 pull requests. 26 of 30 contained at least one vulnerability. Four authentication weaknesses appeared in every final codebase.

GitGuardian: 1.94 billion public GitHub commits analysed. 28.65 million leaked secrets in 2025, up 34%. AI-assisted commits leaked at double the baseline rate.

Georgia Tech: 74 confirmed AI-linked CVEs from 43,849 advisories. Monthly growth: 6 in January, 15 in February, 35 in March 2026. 39 rated Critical or High. Researchers estimate the actual number is 5–10× higher.

The scale of what's been built

Cursor confirmed over one million daily active users by late 2025 and now reports seven million monthly. Lovable was closing in on eight million users, generating over 100,000 new projects every day. Bolt.new reached three million registered users within five months.

Collins Dictionary named "vibe coding" its Word of the Year for 2026. Google reports AI now generates 41% of all code written globally. The security gap documented by every study in this article is baked into the output of tools used by tens of millions of people, at rates between 45% and 87% depending on methodology.

What to do about it

The patterns described in this article are not exotic. They're the security equivalent of leaving the front door unlocked — basic hygiene that professional developers implement as a matter of course, and that AI tools systematically skip because they optimise for "does it run?" rather than "is it safe?"

That's actually good news. It means the problems are fixable.

The AI tools that built your app are not villains. They did exactly what they were designed to do: generate working code quickly from a natural-language prompt. The gap isn't a bug — it's a design choice. Prototyping tools optimise for speed. Production systems require security. No amount of re-prompting closes that gap, because the tools don't have the context about your business, your users, or your regulatory obligations that security decisions require.

That context is what a human assessment provides.

I'm Anatoly Silko, founder of Rocking Tech — a UK-based agency that builds production Laravel platforms, increasingly from AI-generated starting points. If you've built something with AI tools and want to know whether it's production-ready, the original version of this article has more detail on next steps.