DEV Community: Muhammad Hasan

GitHub Advanced Security vs Kolega: why it is already in our repo is not the same as we are covered

Muhammad Hasan — Fri, 12 Jun 2026 17:00:00 +0000

GHAS is the one people end up on by default rather than by choice. It is right there in the repo, you flip it on, CodeQL starts scanning, job done. That convenience is the whole pitch, and it is also the trap.

Where GHAS is good

Let me be fair, because GHAS is not a weak tool. CodeQL is properly good. It does real semantic analysis, actual dataflow and taint tracking, not just pattern matching. Of all the scanners in these comparisons it is the one with the most serious engine under it. If you are all in on GitHub and you want something native that does more than grep for patterns, it is a reasonable thing to have turned on.

Where it falls down in practice

CodeQL is only as good as the queries written for it, and writing good custom queries is genuinely hard, so most teams just run the default pack and never touch it again.

So you get strong analysis pointed at a generic question, which means it is great at the vuln classes GitHub wrote queries for and quiet on everything else, especially anything specific to your own business logic. It also basically assumes you live entirely inside GitHub, and the moment you are across GitLab or Azure or a mix, the "it is already there" advantage evaporates.

The bigger thing

This is the same point that runs through all of these. GHAS finds and hands you a list. You still own the triage, you still write the fix, you still open the PR. The convenience is in the scanning being there, not in the work being done.

GHAS: scan -> here is your list -> the rest is your afternoon
Kolega: scan -> generate fix -> test in sandbox -> open PR -> you review and merge

We scan, generate the fix, test it in a sandbox, and open the PR for you to review. Different job.

The receipts

RealVuln is our open benchmark: 676 real vulnerabilities across 26 production repositories, plus 120 false positive traps to catch tools that flag everything to inflate recall.

RealVuln
- 676 real vulnerabilities
- 26 production repos
- 120 false positive traps
- fully open source

We benchmarked against the serious engines, including the frontier models, not just the easy targets, and you can run your own setup against it and check. The point of making it open is that nobody has to believe the marketing.

So which one

This is not "GHAS bad." It is the strongest default on this list. It is just that having it switched on because it came free with the repo is not the same as actually being covered, and "we have GHAS enabled" tends to be where security thinking stops rather than starts. Worth knowing the difference before you tell a customer you are secure.

Full breakdown and the benchmark: https://kolega.dev/compare/github-advanced-security/

Semgrep vs Kolega: a great floor, but a floor is not a finish line

Muhammad Hasan — Thu, 11 Jun 2026 17:00:00 +0000

Semgrep is the one we get compared to most, and honestly the one we have the most time for, so let me be fair before I get to the but.

Where Semgrep is good

Semgrep is great. It is free, it is fast, the custom rule engine is genuinely good, and "drop it in CI in an afternoon" is a real thing you can do. If you are not running anything yet, run Semgrep today. It is the sensible first move and we would tell you that even though we would rather you used us. No notes on it as a starting point.

Where it stops

Here is the but. Semgrep does exactly what it says: it matches patterns. You give it a rule, it finds things that look like the rule. That is perfect for known signatures, enforcing your own conventions, and catching the obvious stuff.

It is structurally incapable of finding things that are not a pattern, and the vulns that actually end up in incident writeups usually are not patterns:

Business logic flaws
Auth that breaks across multiple files
Second order injection
An operator precedence bug that quietly turns a permission check into a no-op (a real example we found in a secrets manager, of all things) No rule describes those, because they are not patterns. They are the code not meaning what the author thought it meant. You cannot write a Semgrep rule for "this is subtly wrong."

The rare case where we do not have to hand wave

Semgrep is literally on our benchmark. RealVuln is an open benchmark: 676 real vulnerabilities across 26 production repositories, plus 120 false positive traps to catch tools that flag everything to inflate recall.

RealVuln
- 676 real vulnerabilities
- 26 production repos
- 120 false positive traps
- Semgrep score: ~17%
- fully open source

Semgrep sits near the bottom, around 17 percent. Not because it is a bad tool, but because pattern matching has a ceiling and that is the ceiling. Run it yourself against the benchmark, the repo is right there. We published it specifically so nobody has to take our word for it.

So which one

This is not "Semgrep bad." Semgrep is the floor everyone should have. We are the layer that catches what rules cannot see. Best case, you run both: Semgrep for the fast pattern sweep, us for the semantic stuff underneath. We are not trying to delete Semgrep from your stack, just the assumption that it is enough on its own.

Full breakdown and the benchmark: https://kolega.dev/compare/semgrep/

Aikido vs Kolega: the all-in-one platform is wide, but wide is not deep

Muhammad Hasan — Wed, 10 Jun 2026 17:00:00 +0000

Aikido comes up a lot because it is the consolidation play. One dashboard, every scanner, fair price. So it is worth being straight about where that model is genuinely strong and where it quietly falls short.

Where Aikido is good

Credit where it is due, Aikido covers an absurd amount of surface area. SAST, SCA, IaC, container scanning, secrets, DAST, cloud posture, a runtime firewall, AI pentests, all in one place. If your problem is "we have six point tools and a mess of dashboards," Aikido genuinely solves that, and it is reasonably priced and developers like it. As a consolidation tool it is a good product, and we are not pretending to do half of what it does.

It is also worth saying plainly: Aikido does AutoTriage and AutoFix, and it opens pull requests. So this is not the lazy "they only find, we fix" comparison. They fix too.

Where the model has a ceiling

That breadth comes from bundling a stack of scanners under the hood. The actual SAST detection is largely open source engines doing the finding. Aikido's real value is the layer on top: triage to cut the noise, autofix to open the PR.

Which is great, for the vulns the underlying scanner actually found.

You cannot triage a bug you never detected.
You cannot autofix a bug you never detected.
The clever workflow runs after detection, not instead of it.

And detection is exactly where pattern based engines hit their limit. Business logic flaws, auth that breaks across files, second order injection, race conditions. The stuff that does not match a signature. Polishing the workflow around a scanner does not change what the scanner is able to see in the first place.

The receipts

RealVuln is our open benchmark: 676 real vulnerabilities across 26 production repositories, plus 120 false positive traps to catch tools that flag everything to inflate recall.

RealVuln
- 676 real vulnerabilities
- 26 production repos
- 120 false positive traps
- fully open source

The pattern based engines that power most all-in-one SAST sit at the bottom of that leaderboard. Aikido is not on it by name, but it runs the same open source engines for SAST, so you can do the maths. And because the whole thing is open source, you do not have to trust ours, you can run your own setup against it.

So which one

This is not "Aikido bad." It is a genuine difference in shape. Aikido is the widest net, covering code, cloud, containers and runtime in one platform. We are the deepest net on the one part that matters most, finding the code vulnerability before any of the workflow cleverness gets a chance to run.

Pick based on which problem you actually have. If it is "too many tools," Aikido. If it is "our scanner keeps missing the real ones," that is us.

Full breakdown and the benchmark: https://kolega.dev/compare/aikido/

Snyk vs Kolega: why pattern matching has a ceiling, and what sits above it

Muhammad Hasan — Tue, 09 Jun 2026 17:00:00 +0000

Snyk is the tool you get compared to when you build anything in this space, because it is the incumbent everyone already knows. So it is worth being straight about where it is genuinely good and where it stops.

Where Snyk is good

Credit first. Snyk is a really good pattern matcher. It is fast, the IDE plugin is nice, the dependency and SCA story is strong, and developers generally like using it. If your need is known CVEs in your dependencies and the obvious signature level stuff in your own code, it does that well and it does it quickly.

Where it stops

The catch is in the name of the category. Pattern matching finds things that match a pattern. That is perfect for known signatures and textbook issues, and structurally blind to anything that is not one.

The vulns that actually end up in incident writeups usually are not patterns:

Business logic flaws
Auth that breaks across multiple files
Second order injection
Race conditions None of those match a rule, because they are not a shape in the code. They are the code not meaning what the author thought it meant. No signature describes "this is subtly wrong."

We do not have to argue this part

Instead of asking you to trust the claim, we published the test. RealVuln is an open benchmark: 676 real vulnerabilities across 26 production repositories, plus 120 false positive traps (code that looks exploitable but is not, to catch tools that just flag everything to inflate recall).

RealVuln
- 676 real vulnerabilities
- 26 production repos
- 120 false positive traps
- fully open source

The pattern based engines cluster at the bottom. You can run Snyk against the same benchmark yourself and check our numbers. We would honestly prefer you did, because the point of making it open source is that nobody has to take the vendor's word for it.

The other half nobody talks about

Detection is only half the job. Snyk finds and hands you a list. You still own the triage, you still write the fix, you still open the PR. That afternoon of work is yours.

We scan, generate the fix, test it in a sandbox, and open the PR for you to review and merge. Different job entirely.

So which one

This is not "Snyk bad." For dependency and SCA breadth and the in editor experience, Snyk genuinely wins right now, and if that is your whole need it is a fine tool. But if your problem is "we find 200 things and fix 15," that gap is the thing we built for.

Full breakdown and the benchmark: https://kolega.dev/compare/snyk/

We benchmarked 24 SAST tools on ~700 real vulnerabilities. The 3 best known ones came last

Muhammad Hasan — Tue, 09 Jun 2026 10:02:32 +0000

We ran 24 scanners against 26 real Python apps, ~700 labelled vulnerabilities, and scored them on how many they actually caught.

Disclosure: we built the benchmark and our own scanner is in it, which is exactly why the whole thing is open source. Rerun it yourself.

Top of the board by recall (% of real vulns found)

Kolega Enterprise - 95%
GPT-5.5 (agentic) - 58%
GLM-5.1 (agentic) - 56%
DeepSeek V4 Flash (agentic) - 55%
Claude Opus 4.8 (agentic) - 52%
...and the bottom of the board:
Semgrep - 19%
Snyk - 17%
SonarQube - 6%

TLDR

The SAST tools most teams actually run (Semgrep, Snyk, SonarQube) each found under 1 in 5 real vulnerabilities. SonarQube found about 1 in 16. A general-purpose LLM with zero security training, just dropped into an agent loop, found roughly 3x more than the dedicated scanners.

Why? Pattern matchers only catch what matches a known signature, and most real bugs (broken access control, auth that breaks across files, logic flaws) are not a pattern. They are the code not meaning what the author thought it meant.

One result that stuck out: Grok 4.20 had the best precision of anything tested (93%, it basically never cried wolf) but only 26% recall. So you can be extremely precise and still miss three quarters of the bugs. A clean report does not mean secure code.

Full leaderboard, methodology and the raw data: https://realvuln.com

What actually happens to your code when Kolega.dev reads your repo

Muhammad Hasan — Mon, 08 Jun 2026 17:00:00 +0000

If you are even slightly security minded, handing any tool read access to your entire private codebase should make you a bit twitchy. It should make you more twitchy when the tool has "AI" in the pitch, because the unspoken fear is simple: great, so my proprietary code becomes someone's training data.
That is a healthy instinct, and any code scanner worth using should be able to answer it without hand waving. So here is the straight version of how we handle it at Kolega, no marketing.
We do not store your code
Every scan runs in a fresh, isolated container. Each repo is cloned in, the scan runs in one to three minutes, and then the container and everything in it is destroyed.

Connect via OAuth (read only)
Repo cloned into a fresh isolated container
Semantic scan runs (1 to 3 min)
Findings extracted, sensitive data masked
Container destroyed, code wiped What we keep is the findings: severity, file path, line number, fix suggestion. Not the source. The practical upshot matters more than it sounds. If we got breached tomorrow, your code is not in the blast radius, because it is not sitting on our infrastructure to steal. That is a design choice, not a pinky promise. You cannot leak what you do not store. The specifics people actually ask about

OAuth is read only by default, and we do not sit on long lived access tokens.
We do not train models on your code. Not now, not quietly later. It is used for the scan you asked for and nothing else.
If even that is too much, enterprise can run a self hosted runner entirely inside your own VPC. The engine scans on your hardware, results stay where you put them, and nothing about your code reaches us at all.

The part that is not finished yet
Being straight about it: SOC 2 Type II and ISO 27001 are in progress, not done. We run the operational controls those frameworks require today, but the certificates are not on the wall yet, and I would rather say that than badge something we have not earned. If you are in procurement and need the current security overview to fill out a questionnaire, a human will send it back same day.
Full breakdown of the scan lifecycle is here: https://kolega.dev/trust/

SonarQube vs Kolega: why a code quality tool keeps getting sold as a security tool

Muhammad Hasan — Mon, 08 Jun 2026 10:42:31 +0000

SonarQube comes up in these comparisons a lot, which is a bit odd when you remember what it actually is. It is a code quality tool. A really good one. It just wandered into the security aisle at some point and never left.

Where Sonar is good

Credit first. If you want to track code smells, complexity, duplication, maintainability, and test coverage trends over time, Sonar is excellent and has been for years. Teams that care about keeping a big codebase clean get real value out of it. That is its home turf and it is genuinely strong there.

Where the security framing falls apart

The problem is the security framing. SonarQube's vuln detection is bolted onto a quality engine, and it shows. It is pattern and rule based like the rest, so it inherits the same ceiling, but it is also tuned for "is this code tidy" rather than "can someone exploit this."

So you get a pile of maintainability findings dressed up next to a handful of shallow security ones, and the actual exploitable stuff sails straight through:

Logic flaws
Auth that breaks across multiple files
Injection that only shows up second order It was never built to find those. Nobody should be surprised it does not.

We do not have to argue it

RealVuln is our open benchmark: 676 real vulnerabilities across 26 production repositories, plus 120 false positive traps built in to catch tools that flag everything to inflate recall.

RealVuln
- 676 real vulnerabilities
- 26 production repos
- 120 false positive traps
- Sonar score: ~6 to 7%
- fully open source

Sonar lands at the bottom, around 6 to 7 percent. That is not us cherry picking. The whole thing is open source and you can run Sonar against it yourself. We published it so the numbers do the talking instead of the marketing.

So which one

This is not "Sonar bad." Sonar is a good tool aimed at a different job. Keep it for code quality if that is what your team uses it for. Just do not let "we run SonarQube" be the thing you tell your customers when they ask if your code is secure, because those are two different questions and Sonar only answers one of them.

Full breakdown and the benchmark: https://kolega.dev/compare/sonarqube/

We built 24 apps with AI. Three platforms. 561 vulnerabilities.

Muhammad Hasan — Fri, 29 May 2026 09:37:12 +0000

The experiment

Most of what's now being built on top of AI gets called vibe coding. Type what you want, hit enter, watch a working app appear thirty seconds later. Lovable, Replit, Manus, Bolt, V0, every team we know is using one of them or trying to. We've been using them at Kolega too, partly because they're genuinely useful and partly because we wanted to know what was actually in the output.

So we ran the experiment properly.

Eight app categories. Three platforms. Same brief on each platform, every time. Password manager. CRM. Property management. LMS. Healthcare clinic. Loan origination. Legal case management. HR. That's twenty-four codebases in total. We pushed every one to GitHub and pointed Kolega's scanner at it.

One thing we did differently to most "AI security" posts: we changed nothing. Default settings on every platform. Default templates. Default backends. No "make this secure," no "add input validation," no "review the auth flow." We did what a builder does when they sit down to ship something on a Tuesday afternoon. That's the only fair test, because it's the only test that matches what's actually shipping to production every day.

Here's what came back.

The matrix

Every app we built. Every finding. Default settings on every platform, no manual hardening before scanning.

Five hundred and sixty-one vulnerabilities. Three hundred of them critical or high. Across twenty-four apps that, on every platform's own marketing site, were called "production-ready."

Zero of them came with fixes.

What the data actually says

Three things stand out, and the headline is the least interesting of them.

The headline is "AI builds insecure apps," which everyone already suspected. The quieter findings are the ones that change how you should think about this.

Almost nothing came back clean. Out of twenty-four builds, exactly one scanned with zero findings. A Lovable LMS. The other twenty-three shipped with somewhere between six and forty-six findings each. If the question is "will my vibe-coded app have vulnerabilities," the answer is "yes, with 96% certainty, based on this sample." Clean output from a generator is the exception. Not the rule. Not close to the rule.

Total findings hide the real story. Replit and Manus look similar on totals: 26.1 and 29.9 vulnerabilities per app on average. Lovable averages 14.1, which makes Lovable look like the clear winner. But on criticals — the ones that actually breach you — the gap widens differently. Lovable averages 2 criticals per app. Replit averages 3. Manus averages 5. Across eight builds each, that's sixteen, twenty-four, and forty critical vulnerabilities respectively. Two platforms that "look similar" ship 50% more breach-grade defects from one to the other. If you only look at totals, you'd never know.

No platform wins everything. Lovable wins six of eight categories outright. Replit takes the password manager (twelve findings, zero critical, against Manus's forty-three with nine critical on the same brief). Manus wins healthcare (twelve findings, against Lovable's thirty-nine on the same brief). The lazy version of this story is "use Lovable, avoid Manus." The honest version is that each platform has categories it's stronger at, categories it's catastrophic at, and you don't know which until you scan it. The same brief on the same day, generated by three different models, can produce a fifteen-finding app and a forty-five-finding app.

Why this happens, and it's not because the models are bad

It's tempting to look at 561 vulnerabilities and conclude that AI builders are dangerous, and you should go write your own code. That's the wrong lesson. The right lesson is that these tools were built with different goals than the ones we keep judging them by.

Default settings on every one of these platforms are optimised for "something working in front of the user." That means permissive CORS so the preview pane renders. Generous database access so the demo doesn't 500. Hardcoded API keys so the build doesn't require an env-var lecture before the user sees output. Every one of those defaults makes the "hit enter, watch it work" experience smooth. Every one of them is also a vulnerability when you ship it.

The training data doesn't help. The corpus these models learn from is overwhelmingly tutorials, Stack Overflow answers, and example code from documentation. Tutorials skip auth checks for clarity. Stack Overflow answers fix the specific bug being asked about and ignore everything else. Example code from documentation is built to show you the feature, not secure the feature. The training set is the world's largest collection of "this works in a tutorial" code, and that's exactly the code that comes out.

And none of the platforms we tested run a meaningful security pass before they hand you the output. Lovable has a Security Checker, which is one reason their numbers came back better. Replit and Manus don't, in any visible way. None of them have anything that reads data flow across handlers, which is the only way to catch the breach-grade bugs that actually matter.

So what you get is generators that ship working code with the security posture of a sample app. Which is fine if you're prototyping. Less fine if you're shipping the thing you're going to sell to customers.

We adopted new tools for writing code. We didn't adopt new tools for maintaining it.

This is the part that matters more than any single platform comparison.

The thing about vibe coding isn't that the code is bad. The code is fine. The code works. The code probably looks better than what most of us would have written, in the time we would have written it. The thing about vibe coding is that it shifts where the work goes. Less time typing. More time owning what you didn't type.

That shift has caught most teams flat-footed. The toolchain for writing code has changed completely in the last two years. The toolchain for maintaining code has barely budged. Most teams are still using the same scanners, the same review checklists, the same QA processes they used in 2022, applied to a codebase that's now sixty percent generated and growing.

It doesn't work. Old SAST tools were tuned for the bugs humans write. They flag the missing input check, the SQL injection, the unsafe deserialization, the SSRF. They don't flag the BOLA bug we wrote about last month, where the auth check is present but the query doesn't filter by the authenticated user. They don't flag the hardcoded Supabase key with full database access embedded in a frontend bundle. They don't flag the missing rate limit on the password reset endpoint that lets a researcher with a free account read everyone's source code. The bugs AI generates aren't the same shape as the bugs humans generate, and the tools designed to catch the latter aren't catching the former.

That mismatch has a name. We've used it before. We called it Control Drift: the space between what your team can ship and what your team can govern. Vibe coding makes that gap wider every week, and the tools most teams rely on aren't closing it.

The fix is the same one our last post pointed at, applied at a different layer. Read the data flow, not the syntax. Run that read on every PR, not once a quarter. Catch the bug at the place it's written, not the place a security researcher finds it after forty-eight days.

What you should actually do

Three things, in order.

Scan whatever you ship. Not your repo's main branch once a month. Every PR, every commit. The cost of catching a bug at the PR is roughly zero. The cost of catching it after a researcher posts a thread on X is whatever your company is worth.

Stop trusting platform defaults. If your stack inherits CORS settings, database permissions, or auth scopes from a generator's template, treat them as starting points and tighten them. The default exists because it makes the demo work. Your production isn't a demo.

Treat AI output the way you treat a junior dev's first PR. The code might be great. It also might have shipped a BOLA, a missing rate limit, and an exposed admin endpoint, and the model that wrote it has no idea which. Review the structure, not just the function.

Where this ends

We're in the build-fast era. That's not changing, and we're not asking it to. Building fast is solved. Twenty-four working apps in a week with three different platforms, that wouldn't have been possible at all in 2023. We built and tested all of them in under a month. The acceleration is real and useful and we're not giving it back.

Maintaining what you build at that pace, though, is where we still haven't caught up. The tooling hasn't shifted. The reviews haven't shifted. The mental model that "the model wrote it so it's probably fine" is doing a lot of quiet damage to a lot of codebases, and the bill comes due when somebody scans the repo and finds out what's in there.

That's Control Drift, again. And it's going to keep happening to AI-native teams until the industry stops pretending the scanner that worked in 2022 is going to catch the bugs that ship in 2026.

We built Kolega for the maintenance half of the problem. Not because nobody else can write a scanner, but because the scanners that exist were designed for a kind of code most of us don't really write anymore. The 561 vulnerabilities in this study didn't come from "AI is dangerous." They came from "we sped up the writing and forgot to upgrade the reviewing."

If you want to know what's actually in the AI-generated code you've shipped, scan it. We do that. It takes about three minutes.

kolega.dev — semantic analysis on every PR.

What "merge-ready" actually requires when an AI writes the security fix

Muhammad Hasan — Wed, 06 May 2026 09:31:53 +0000

Most code-scanning tools stop at "we found a vulnerability." That's the easy part.
The hard part — the part nobody talks about until they try to ship it — is everything that happens between "vulnerability detected" and "PR a maintainer will actually merge." Tests passing. Style matching. The fix actually fixing the thing. The fix not breaking anything else. A PR description a maintainer can verify in 30 seconds.
We work on this problem at Kolega, and we want to walk through what's in that gap honestly — including the parts we got wrong.
The five-stage pipeline
Every auto-generated fix in our system goes through five stages:

Detect with context — find the vuln, but also understand the code around it
Generate a candidate fix — LLM-assisted, but heavily constrained
Validate correctness — does it compile, do tests pass, is the vuln actually gone
Match the project's style — formatting, naming, patterns
Decide whether to ship it at all — some fixes shouldn't be automated

If any stage fails, the PR doesn't get opened. We'd rather open zero PRs than one bad one. Maintainers will block your bot for a week of bad PRs in a way they won't for none at all.
Stage 1: Detection with context
Static analysis tools will tell you "user input flows into eval() on line 47." That's true. It's also basically useless on its own, because it doesn't tell you:

What eval() is being used for
Whether the input is already sanitised upstream
What the function's contract is (does it need to handle strings, or always JSON?)
Whether replacing eval() with JSON.parse() would break legitimate callers

Without this context, an LLM asked to "fix this" will generate something that compiles, looks reasonable, and is wrong.
Our detection layer pulls:

The vulnerable function and its callers (one or two hops out)
Type information where available
Existing tests that exercise the function
Recent git history for the file (recently-touched code is more fragile)
The project's dependencies and their versions

This context is what gets passed to generation. Detection isn't "where's the bug" — it's "here's everything someone reviewing a fix would need."
Stage 2: Generating the fix
We use LLMs for fix generation. We do not let them generate freely.
The constraints we apply:
Scope locking. The model is only allowed to modify a small, specified region of the file. If a fix would require changes outside that region, we surface it for human review instead of auto-generating.
Pattern catalogues. For common vulnerability classes — SQL injection, prototype pollution, hardcoded secrets, missing auth checks — we have known-good fix patterns. The model picks and adapts a pattern rather than inventing one. This dramatically reduces hallucinated "fixes" that don't actually fix anything.
Explanation alongside code. The model has to produce a structured explanation of why the change works, in a format we can validate against the original CVE/CWE. Forcing the model to articulate its reasoning catches a lot of confidently-wrong outputs.
The thing we learned the hard way: if the model can't explain its fix in terms of the vulnerability class, the fix is usually wrong. "I added a check" isn't an explanation. "I added a check that ensures proto cannot be assigned via this code path, closing the prototype pollution vector identified in CWE-1321" is.
Stage 3: Validation
Generation produces a candidate. Validation decides if it's actually mergeable. We run three layers:
Layer 1 — does it compile / parse? Sounds trivial; isn't, especially in dynamic languages where syntactic correctness doesn't catch broken imports or undefined references.
Layer 2 — do existing tests pass? The fix has to leave the existing test suite green. This catches a huge class of "fix introduces regression" failures. If the project has no tests, we treat that as a signal to be more conservative, not less.
Layer 3 — is the original vulnerability actually gone? We re-run the detection step against the fixed code. If the same finding still fires, the "fix" didn't fix it. This sounds obvious, but it's a step a lot of pipelines skip — and it's the difference between security theatre and an actual fix.
If a candidate fails any layer, we either regenerate (passing the failure back as additional context) or escalate to human review. We cap regenerations at three. After that, the problem isn't a bad model output — it's a fix that requires judgement we don't have.
Stage 4: Matching project style
A correctly-functioning fix that's formatted wrong, named wrong, or imported wrong will get closed without comment. Maintainers can smell bot PRs in two seconds.
Things we match:

Indentation, quote style, semicolons (run the project's formatter before opening the PR)
Naming conventions (camelCase vs snake_case, prefix patterns)
Import style (relative vs absolute, grouping)
Comment style (do they write JSDoc? Do they write any comments at all?)
Commit message format (Conventional Commits? Specific prefixes?)

This is unglamorous work, and it's where a lot of automated tools fail. You don't get a second chance at first-PR impression with a maintainer.
Stage 5: Knowing when not to ship
Some fixes shouldn't be automated. Examples from our own ruleset:

Auth and authz logic. A fix that changes who can access what needs human eyes. Always.
Cryptographic primitives. Swapping algorithms or key sizes can have downstream consequences a pipeline can't see.
Code touched in the last 7 days. Active development means context we don't have.
Repos with no tests. We can't validate the fix doesn't break anything, so we surface findings without auto-PRs.
Findings below a confidence threshold. Better to flag and let a human triage.

The single biggest credibility lever for a tool like this is how often it shuts up when it doesn't know. Every false positive PR costs trust. Every "we found this but didn't auto-fix it because [specific reason]" message builds trust.
What we got wrong
A few things, in case it's useful:
We over-trusted model self-validation. We asked the model "is this fix correct?" and weighted its answer. It said yes too often. Switching to external validation (run the tests, re-run the scanner) was the single biggest quality jump we made.
We let the model write PR descriptions freely. They were verbose and sometimes inaccurate. Now descriptions are templated, with the model filling specific slots: vulnerability class, file/function affected, fix pattern applied, validation results. Boring, but verifiable.
We didn't track merge rate by pattern. Once we did, we found two of our pattern catalogues were producing fixes that maintainers rejected for style reasons we hadn't noticed. Boring data work, big quality gain.
If you're building something similar
Three things, if you're working on any kind of automated code-modification pipeline:

Detection without context is a trap. Spend more time on context than on detection.
Constrain your model. Free-form generation is fine for prototypes. Production needs scope locks, pattern catalogues, and structured outputs.
External validation beats self-validation. The model can't reliably grade its own work. Run the tests. Re-run the scanner. Don't ask it.

Happy to go deeper on any of these in a follow-up — drop a comment if there's a stage you'd want more detail on.

DEV Community: Muhammad Hasan

GitHub Advanced Security vs Kolega: why it is already in our repo is not the same as we are covered

Where GHAS is good

Where it falls down in practice

The bigger thing

The receipts

So which one

Semgrep vs Kolega: a great floor, but a floor is not a finish line

Where Semgrep is good

Where it stops

The rare case where we do not have to hand wave

So which one

Aikido vs Kolega: the all-in-one platform is wide, but wide is not deep

Where Aikido is good

Where the model has a ceiling

The receipts

So which one

Snyk vs Kolega: why pattern matching has a ceiling, and what sits above it

Where Snyk is good

Where it stops

We do not have to argue this part

The other half nobody talks about

So which one

We benchmarked 24 SAST tools on ~700 real vulnerabilities. The 3 best known ones came last

Top of the board by recall (% of real vulns found)

TLDR

What actually happens to your code when Kolega.dev reads your repo

SonarQube vs Kolega: why a code quality tool keeps getting sold as a security tool

Where Sonar is good

Where the security framing falls apart

We do not have to argue it

So which one

We built 24 apps with AI. Three platforms. 561 vulnerabilities.

The experiment

The matrix

What the data actually says

Why this happens, and it's not because the models are bad

We adopted new tools for writing code. We didn't adopt new tools for maintaining it.

What you should actually do

Where this ends

What "merge-ready" actually requires when an AI writes the security fix