<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Muhammad Hasan</title>
    <description>The latest articles on DEV Community by Muhammad Hasan (@muhammad_hasan).</description>
    <link>https://dev.to/muhammad_hasan</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3915589%2Fa03b075b-a838-4446-b4f5-9149f5d45e26.png</url>
      <title>DEV Community: Muhammad Hasan</title>
      <link>https://dev.to/muhammad_hasan</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/muhammad_hasan"/>
    <language>en</language>
    <item>
      <title>We built 24 apps with AI. Three platforms. 561 vulnerabilities.</title>
      <dc:creator>Muhammad Hasan</dc:creator>
      <pubDate>Fri, 29 May 2026 09:37:12 +0000</pubDate>
      <link>https://dev.to/muhammad_hasan/we-built-24-apps-with-ai-three-platforms-561-vulnerabilities-gp7</link>
      <guid>https://dev.to/muhammad_hasan/we-built-24-apps-with-ai-three-platforms-561-vulnerabilities-gp7</guid>
      <description>&lt;h1&gt;
  
  
  The experiment
&lt;/h1&gt;

&lt;p&gt;Most of what's now being built on top of AI gets called vibe coding. Type what you want, hit enter, watch a working app appear thirty seconds later. Lovable, Replit, Manus, Bolt, V0, every team we know is using one of them or trying to. We've been using them at Kolega too, partly because they're genuinely useful and partly because we wanted to know what was actually in the output.&lt;/p&gt;

&lt;p&gt;So we ran the experiment properly.&lt;/p&gt;

&lt;p&gt;Eight app categories. Three platforms. Same brief on each platform, every time. Password manager. CRM. Property management. LMS. Healthcare clinic. Loan origination. Legal case management. HR. That's twenty-four codebases in total. We pushed every one to GitHub and pointed Kolega's scanner at it.&lt;/p&gt;

&lt;p&gt;One thing we did differently to most "AI security" posts: we changed nothing. Default settings on every platform. Default templates. Default backends. No "make this secure," no "add input validation," no "review the auth flow." We did what a builder does when they sit down to ship something on a Tuesday afternoon. That's the only fair test, because it's the only test that matches what's actually shipping to production every day.&lt;/p&gt;

&lt;p&gt;Here's what came back.&lt;/p&gt;

&lt;h1&gt;
  
  
  The matrix
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb7qogm6r8xtwg691ofkg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb7qogm6r8xtwg691ofkg.png" alt=" " width="800" height="693"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Every app we built. Every finding. Default settings on every platform, no manual hardening before scanning.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Five hundred and sixty-one vulnerabilities. Three hundred of them critical or high. Across twenty-four apps that, on every platform's own marketing site, were called "production-ready."&lt;/p&gt;

&lt;p&gt;Zero of them came with fixes.&lt;/p&gt;

&lt;h1&gt;
  
  
  What the data actually says
&lt;/h1&gt;

&lt;p&gt;Three things stand out, and the headline is the least interesting of them.&lt;/p&gt;

&lt;p&gt;The headline is "AI builds insecure apps," which everyone already suspected. The quieter findings are the ones that change how you should think about this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Almost nothing came back clean.&lt;/strong&gt; Out of twenty-four builds, exactly one scanned with zero findings. A Lovable LMS. The other twenty-three shipped with somewhere between six and forty-six findings each. If the question is "will my vibe-coded app have vulnerabilities," the answer is "yes, with 96% certainty, based on this sample." Clean output from a generator is the exception. Not the rule. Not close to the rule.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Total findings hide the real story.&lt;/strong&gt; Replit and Manus look similar on totals: 26.1 and 29.9 vulnerabilities per app on average. Lovable averages 14.1, which makes Lovable look like the clear winner. But on criticals — the ones that actually breach you — the gap widens differently. Lovable averages 2 criticals per app. Replit averages 3. Manus averages 5. Across eight builds each, that's sixteen, twenty-four, and forty critical vulnerabilities respectively. Two platforms that "look similar" ship 50% more breach-grade defects from one to the other. If you only look at totals, you'd never know.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No platform wins everything.&lt;/strong&gt; Lovable wins six of eight categories outright. Replit takes the password manager (twelve findings, zero critical, against Manus's forty-three with nine critical on the same brief). Manus wins healthcare (twelve findings, against Lovable's thirty-nine on the same brief). The lazy version of this story is "use Lovable, avoid Manus." The honest version is that each platform has categories it's stronger at, categories it's catastrophic at, and you don't know which until you scan it. The same brief on the same day, generated by three different models, can produce a fifteen-finding app and a forty-five-finding app.&lt;/p&gt;

&lt;h1&gt;
  
  
  Why this happens, and it's not because the models are bad
&lt;/h1&gt;

&lt;p&gt;It's tempting to look at 561 vulnerabilities and conclude that AI builders are dangerous, and you should go write your own code. That's the wrong lesson. The right lesson is that these tools were built with different goals than the ones we keep judging them by.&lt;/p&gt;

&lt;p&gt;Default settings on every one of these platforms are optimised for "something working in front of the user." That means permissive CORS so the preview pane renders. Generous database access so the demo doesn't 500. Hardcoded API keys so the build doesn't require an env-var lecture before the user sees output. Every one of those defaults makes the "hit enter, watch it work" experience smooth. Every one of them is also a vulnerability when you ship it.&lt;/p&gt;

&lt;p&gt;The training data doesn't help. The corpus these models learn from is overwhelmingly tutorials, Stack Overflow answers, and example code from documentation. Tutorials skip auth checks for clarity. Stack Overflow answers fix the specific bug being asked about and ignore everything else. Example code from documentation is built to show you the feature, not secure the feature. The training set is the world's largest collection of "this works in a tutorial" code, and that's exactly the code that comes out.&lt;/p&gt;

&lt;p&gt;And none of the platforms we tested run a meaningful security pass before they hand you the output. Lovable has a Security Checker, which is one reason their numbers came back better. Replit and Manus don't, in any visible way. None of them have anything that reads data flow across handlers, which is the only way to catch the breach-grade bugs that actually matter.&lt;/p&gt;

&lt;p&gt;So what you get is generators that ship working code with the security posture of a sample app. Which is fine if you're prototyping. Less fine if you're shipping the thing you're going to sell to customers.&lt;/p&gt;

&lt;h1&gt;
  
  
  We adopted new tools for writing code. We didn't adopt new tools for maintaining it.
&lt;/h1&gt;

&lt;p&gt;This is the part that matters more than any single platform comparison.&lt;/p&gt;

&lt;p&gt;The thing about vibe coding isn't that the code is bad. The code is fine. The code works. The code probably looks better than what most of us would have written, in the time we would have written it. The thing about vibe coding is that it shifts where the work goes. Less time typing. More time owning what you didn't type.&lt;/p&gt;

&lt;p&gt;That shift has caught most teams flat-footed. The toolchain for writing code has changed completely in the last two years. The toolchain for maintaining code has barely budged. Most teams are still using the same scanners, the same review checklists, the same QA processes they used in 2022, applied to a codebase that's now sixty percent generated and growing.&lt;/p&gt;

&lt;p&gt;It doesn't work. Old SAST tools were tuned for the bugs humans write. They flag the missing input check, the SQL injection, the unsafe deserialization, the SSRF. They don't flag the BOLA bug we wrote about last month, where the auth check is present but the query doesn't filter by the authenticated user. They don't flag the hardcoded Supabase key with full database access embedded in a frontend bundle. They don't flag the missing rate limit on the password reset endpoint that lets a researcher with a free account read everyone's source code. The bugs AI generates aren't the same shape as the bugs humans generate, and the tools designed to catch the latter aren't catching the former.&lt;/p&gt;

&lt;p&gt;That mismatch has a name. We've used it before. We called it Control Drift: the space between what your team can ship and what your team can govern. Vibe coding makes that gap wider every week, and the tools most teams rely on aren't closing it.&lt;/p&gt;

&lt;p&gt;The fix is the same one our last post pointed at, applied at a different layer. Read the data flow, not the syntax. Run that read on every PR, not once a quarter. Catch the bug at the place it's written, not the place a security researcher finds it after forty-eight days.&lt;/p&gt;

&lt;h1&gt;
  
  
  What you should actually do
&lt;/h1&gt;

&lt;p&gt;Three things, in order.&lt;/p&gt;

&lt;p&gt;Scan whatever you ship. Not your repo's main branch once a month. Every PR, every commit. The cost of catching a bug at the PR is roughly zero. The cost of catching it after a researcher posts a thread on X is whatever your company is worth.&lt;/p&gt;

&lt;p&gt;Stop trusting platform defaults. If your stack inherits CORS settings, database permissions, or auth scopes from a generator's template, treat them as starting points and tighten them. The default exists because it makes the demo work. Your production isn't a demo.&lt;/p&gt;

&lt;p&gt;Treat AI output the way you treat a junior dev's first PR. The code might be great. It also might have shipped a BOLA, a missing rate limit, and an exposed admin endpoint, and the model that wrote it has no idea which. Review the structure, not just the function.&lt;/p&gt;

&lt;h1&gt;
  
  
  Where this ends
&lt;/h1&gt;

&lt;p&gt;We're in the build-fast era. That's not changing, and we're not asking it to. Building fast is solved. Twenty-four working apps in a week with three different platforms, that wouldn't have been possible at all in 2023. We built and tested all of them in under a month. The acceleration is real and useful and we're not giving it back.&lt;/p&gt;

&lt;p&gt;Maintaining what you build at that pace, though, is where we still haven't caught up. The tooling hasn't shifted. The reviews haven't shifted. The mental model that "the model wrote it so it's probably fine" is doing a lot of quiet damage to a lot of codebases, and the bill comes due when somebody scans the repo and finds out what's in there.&lt;/p&gt;

&lt;p&gt;That's Control Drift, again. And it's going to keep happening to AI-native teams until the industry stops pretending the scanner that worked in 2022 is going to catch the bugs that ship in 2026.&lt;/p&gt;

&lt;p&gt;We built Kolega for the maintenance half of the problem. Not because nobody else can write a scanner, but because the scanners that exist were designed for a kind of code most of us don't really write anymore. The 561 vulnerabilities in this study didn't come from "AI is dangerous." They came from "we sped up the writing and forgot to upgrade the reviewing."&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;If you want to know what's actually in the AI-generated code you've shipped, scan it. We do that. It takes about three minutes.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;a href="https://app.kolega.dev/sign-up" rel="noopener noreferrer"&gt;kolega.dev&lt;/a&gt; — semantic analysis on every PR.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>ai</category>
      <category>vibecoding</category>
      <category>security</category>
      <category>manus</category>
    </item>
    <item>
      <title>What "merge-ready" actually requires when an AI writes the security fix</title>
      <dc:creator>Muhammad Hasan</dc:creator>
      <pubDate>Wed, 06 May 2026 09:31:53 +0000</pubDate>
      <link>https://dev.to/muhammad_hasan/what-merge-ready-actually-requires-when-an-ai-writes-the-security-fix-40e1</link>
      <guid>https://dev.to/muhammad_hasan/what-merge-ready-actually-requires-when-an-ai-writes-the-security-fix-40e1</guid>
      <description>&lt;p&gt;Most code-scanning tools stop at "we found a vulnerability." That's the easy part.&lt;br&gt;
The hard part — the part nobody talks about until they try to ship it — is everything that happens between "vulnerability detected" and "PR a maintainer will actually merge." Tests passing. Style matching. The fix actually fixing the thing. The fix not breaking anything else. A PR description a maintainer can verify in 30 seconds.&lt;br&gt;
We work on this problem at Kolega, and we want to walk through what's in that gap honestly — including the parts we got wrong.&lt;br&gt;
The five-stage pipeline&lt;br&gt;
Every auto-generated fix in our system goes through five stages:&lt;/p&gt;

&lt;p&gt;Detect with context — find the vuln, but also understand the code around it&lt;br&gt;
Generate a candidate fix — LLM-assisted, but heavily constrained&lt;br&gt;
Validate correctness — does it compile, do tests pass, is the vuln actually gone&lt;br&gt;
Match the project's style — formatting, naming, patterns&lt;br&gt;
Decide whether to ship it at all — some fixes shouldn't be automated&lt;/p&gt;

&lt;p&gt;If any stage fails, the PR doesn't get opened. We'd rather open zero PRs than one bad one. Maintainers will block your bot for a week of bad PRs in a way they won't for none at all.&lt;br&gt;
Stage 1: Detection with context&lt;br&gt;
Static analysis tools will tell you "user input flows into eval() on line 47." That's true. It's also basically useless on its own, because it doesn't tell you:&lt;/p&gt;

&lt;p&gt;What eval() is being used for&lt;br&gt;
Whether the input is already sanitised upstream&lt;br&gt;
What the function's contract is (does it need to handle strings, or always JSON?)&lt;br&gt;
Whether replacing eval() with JSON.parse() would break legitimate callers&lt;/p&gt;

&lt;p&gt;Without this context, an LLM asked to "fix this" will generate something that compiles, looks reasonable, and is wrong.&lt;br&gt;
Our detection layer pulls:&lt;/p&gt;

&lt;p&gt;The vulnerable function and its callers (one or two hops out)&lt;br&gt;
Type information where available&lt;br&gt;
Existing tests that exercise the function&lt;br&gt;
Recent git history for the file (recently-touched code is more fragile)&lt;br&gt;
The project's dependencies and their versions&lt;/p&gt;

&lt;p&gt;This context is what gets passed to generation. Detection isn't "where's the bug" — it's "here's everything someone reviewing a fix would need."&lt;br&gt;
Stage 2: Generating the fix&lt;br&gt;
We use LLMs for fix generation. We do not let them generate freely.&lt;br&gt;
The constraints we apply:&lt;br&gt;
Scope locking. The model is only allowed to modify a small, specified region of the file. If a fix would require changes outside that region, we surface it for human review instead of auto-generating.&lt;br&gt;
Pattern catalogues. For common vulnerability classes — SQL injection, prototype pollution, hardcoded secrets, missing auth checks — we have known-good fix patterns. The model picks and adapts a pattern rather than inventing one. This dramatically reduces hallucinated "fixes" that don't actually fix anything.&lt;br&gt;
Explanation alongside code. The model has to produce a structured explanation of why the change works, in a format we can validate against the original CVE/CWE. Forcing the model to articulate its reasoning catches a lot of confidently-wrong outputs.&lt;br&gt;
The thing we learned the hard way: if the model can't explain its fix in terms of the vulnerability class, the fix is usually wrong. "I added a check" isn't an explanation. "I added a check that ensures &lt;strong&gt;proto&lt;/strong&gt; cannot be assigned via this code path, closing the prototype pollution vector identified in CWE-1321" is.&lt;br&gt;
Stage 3: Validation&lt;br&gt;
Generation produces a candidate. Validation decides if it's actually mergeable. We run three layers:&lt;br&gt;
Layer 1 — does it compile / parse? Sounds trivial; isn't, especially in dynamic languages where syntactic correctness doesn't catch broken imports or undefined references.&lt;br&gt;
Layer 2 — do existing tests pass? The fix has to leave the existing test suite green. This catches a huge class of "fix introduces regression" failures. If the project has no tests, we treat that as a signal to be more conservative, not less.&lt;br&gt;
Layer 3 — is the original vulnerability actually gone? We re-run the detection step against the fixed code. If the same finding still fires, the "fix" didn't fix it. This sounds obvious, but it's a step a lot of pipelines skip — and it's the difference between security theatre and an actual fix.&lt;br&gt;
If a candidate fails any layer, we either regenerate (passing the failure back as additional context) or escalate to human review. We cap regenerations at three. After that, the problem isn't a bad model output — it's a fix that requires judgement we don't have.&lt;br&gt;
Stage 4: Matching project style&lt;br&gt;
A correctly-functioning fix that's formatted wrong, named wrong, or imported wrong will get closed without comment. Maintainers can smell bot PRs in two seconds.&lt;br&gt;
Things we match:&lt;/p&gt;

&lt;p&gt;Indentation, quote style, semicolons (run the project's formatter before opening the PR)&lt;br&gt;
Naming conventions (camelCase vs snake_case, prefix patterns)&lt;br&gt;
Import style (relative vs absolute, grouping)&lt;br&gt;
Comment style (do they write JSDoc? Do they write any comments at all?)&lt;br&gt;
Commit message format (Conventional Commits? Specific prefixes?)&lt;/p&gt;

&lt;p&gt;This is unglamorous work, and it's where a lot of automated tools fail. You don't get a second chance at first-PR impression with a maintainer.&lt;br&gt;
Stage 5: Knowing when not to ship&lt;br&gt;
Some fixes shouldn't be automated. Examples from our own ruleset:&lt;/p&gt;

&lt;p&gt;Auth and authz logic. A fix that changes who can access what needs human eyes. Always.&lt;br&gt;
Cryptographic primitives. Swapping algorithms or key sizes can have downstream consequences a pipeline can't see.&lt;br&gt;
Code touched in the last 7 days. Active development means context we don't have.&lt;br&gt;
Repos with no tests. We can't validate the fix doesn't break anything, so we surface findings without auto-PRs.&lt;br&gt;
Findings below a confidence threshold. Better to flag and let a human triage.&lt;/p&gt;

&lt;p&gt;The single biggest credibility lever for a tool like this is how often it shuts up when it doesn't know. Every false positive PR costs trust. Every "we found this but didn't auto-fix it because [specific reason]" message builds trust.&lt;br&gt;
What we got wrong&lt;br&gt;
A few things, in case it's useful:&lt;br&gt;
We over-trusted model self-validation. We asked the model "is this fix correct?" and weighted its answer. It said yes too often. Switching to external validation (run the tests, re-run the scanner) was the single biggest quality jump we made.&lt;br&gt;
We let the model write PR descriptions freely. They were verbose and sometimes inaccurate. Now descriptions are templated, with the model filling specific slots: vulnerability class, file/function affected, fix pattern applied, validation results. Boring, but verifiable.&lt;br&gt;
We didn't track merge rate by pattern. Once we did, we found two of our pattern catalogues were producing fixes that maintainers rejected for style reasons we hadn't noticed. Boring data work, big quality gain.&lt;br&gt;
If you're building something similar&lt;br&gt;
Three things, if you're working on any kind of automated code-modification pipeline:&lt;/p&gt;

&lt;p&gt;Detection without context is a trap. Spend more time on context than on detection.&lt;br&gt;
Constrain your model. Free-form generation is fine for prototypes. Production needs scope locks, pattern catalogues, and structured outputs.&lt;br&gt;
External validation beats self-validation. The model can't reliably grade its own work. Run the tests. Re-run the scanner. Don't ask it.&lt;/p&gt;

&lt;p&gt;Happy to go deeper on any of these in a follow-up — drop a comment if there's a stage you'd want more detail on.&lt;/p&gt;

</description>
      <category>security</category>
      <category>devops</category>
      <category>ai</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
