Originally published on the AstroLexis blog. Cross-posted here for the community.
Every modern SAST tool — Snyk, SonarQube Cloud, GitHub Advanced Security, Semgrep AppSec Platform — asks the same thing: ship your source code to us, we'll tell you what's wrong with it. For a non-trivial number of teams, that's a non-starter. Here's how we built KCode, the static analysis tool that runs the LLM verifier on your own hardware, and what we learned about getting machine-grade precision out of a local model.
The day SAST became my problem
I'm Bruno, founder of AstroLexis. About a year before we started building KCode, I was the only engineer on a codebase that didn't tolerate uploading source. The reasons were the usual mix: enterprise customers with NDAs that explicitly forbade third-party SaaS code scanning, defense-adjacent contracts, jurisdictional restrictions that made any non-EU data residency a paperwork nightmare. The work was real, the policies were real, and the tooling we needed wasn't.
The market for "static analysis you can actually deploy on-prem" turned out to be remarkably bad. Snyk, SonarQube Cloud, and GitHub Advanced Security are SaaS-first. The on-prem versions exist but are priced for Fortune 500 and ship with the kind of installation playbook that needs a dedicated DevSecOps engineer to maintain. Semgrep has an open-source core, which is great, but the rule set that catches real bugs lives in their commercial platform. Local linters (ESLint, Pylint, Bandit, gosec) catch surface-level issues but miss anything that requires reasoning across files or distinguishing between "this looks scary" and "this actually exploits."
And then LLMs arrived and complicated everything. Suddenly you could ask Claude or GPT-4 about a file and get genuinely insightful security analysis. The catch: that file just went to someone else's datacenter. For the work I was doing, that wasn't a tradeoff — it was a deal-breaker.
So we built the tool we needed.
What KCode actually does
The architecture is intentionally boring:
- Deterministic pre-filter. 414 hand-curated patterns across 20+ languages (C, C++, Rust, Go, Python, TypeScript, JavaScript, Java, Kotlin, Swift, Ruby, PHP, Bash, SQL, YAML, HCL, and more). 372 of them are regex, 27 are AST-based for the rules that need structural awareness (control flow, taint, scope). The patterns generate candidates: files and line ranges that look like they might be a problem.
- Local LLM verifier. The candidates get fed to a local LLM (we recommend a 24GB+ GPU running a 30B-parameter model in 4-bit quantization). The model's job is to confirm or reject: "is this candidate actually exploitable given the surrounding code, or is it a false positive?" The verifier sees only the relevant code snippets — it doesn't need the whole repo in context.
- Output. SARIF format for CI integration, Markdown reports for humans, optional PDF for stakeholders.
That's it. Two stages, deterministic plus probabilistic. The cleverness is in the patterns and in how we prompt the verifier — not in trying to make the LLM do everything from scratch.
Benchmarks on the SAST validation suite:
- 100% precision
- 92.3% recall
- F1 score: 0.96
- 414 hand-curated patterns across 20+ languages
Why the architecture matters
People who haven't shipped a SAST tool tend to underestimate how much of the difficulty is false positive management. A scanner that finds 500 issues, of which 30 are real, doesn't actually help anyone. Developers stop opening the report after the third Tuesday. The signal-to-noise ratio kills adoption faster than missed bugs do.
This is where the local LLM earns its keep. Regex and AST patterns can identify shape — "this function calls strcpy with a user-controlled buffer", "this SQL string interpolates a variable" — but they can't reason about context. Does the buffer get bounded earlier? Is the variable sanitized at the controller layer? Is the entire function only reachable from a test fixture?
The LLM verifier handles exactly that contextual judgment, and it's good at it. In our benchmarks, the verifier rejects roughly 60-75% of the candidates that the deterministic pre-filter raises. The ones that survive are the real findings.
Crucially, the LLM never has to find the bug from scratch. The deterministic pre-filter narrows the search space from "scan a million lines of code" to "evaluate 800 candidates." That makes the inference budget manageable: a full audit of a 500K-line codebase runs in about 10,000 tokens of verifier input, not 300K+. We can run that on a single consumer GPU in minutes.
The benchmark that mattered: NASA IDF
Public benchmarks are great for marketing slides. Real validation comes from running against actual codebases written by people who weren't grading themselves.
We ran KCode against NASA's IDF — a piece of flight-software-adjacent open source. The IDF repo isn't toy code: it's instrumentation infrastructure used in real telemetry pipelines, written in C++ and Python, maintained by people whose job titles include "Senior Software Engineer, Flight Systems".
KCode opened PR #107 against the repo, identifying 28 bugs across the codebase. The breakdown:
- Buffer overflows from unchecked string operations (the C++ classics).
- Missing null checks on pointers returned from allocation paths.
- Integer truncation in size calculations that would silently corrupt under specific input ranges.
- Race conditions in concurrent state mutation that the linter had missed because the relevant globals were declared three files away.
- A handful of Python issues around exception handling that swallowed errors silently.
The NASA team merged the changes. That's the validation that matters: real bugs, in real production-adjacent code, accepted by maintainers who know the codebase.
What we got wrong (and how we fixed it)
The first version of KCode was a mess. The verifier was hallucinating. The pre-filter was over-firing. Our F1 on the validation suite was a depressing 0.71 for months. Three things turned it around:
1. Cascade verification
A single LLM verifier has a measurable false-positive rate. We could either (a) lower the temperature and pray, or (b) chain two verifiers with different model families and only accept findings both confirm. We picked (b). The current production setup runs Grok + Claude Opus in an ensemble: both have to agree the candidate is real before it lands in the report. False positives dropped by 60%. The cost is roughly 2× verifier tokens, which on local hardware costs nothing meaningful.
2. Output filter for "prompt rules miss"
The LLM verifier will occasionally produce output that looks like a valid finding but is structurally malformed for SARIF — wrong line numbers, missing severity, weird character escaping. We built a strict output filter that rejects malformed verifier output and re-prompts. This sounds boring; it's actually one of the most load-bearing pieces of the system. Without it, ~3% of findings showed up as garbage. With it, the SARIF output is parseable by every downstream tool we've tried (GitHub Code Scanning, SonarQube import, custom dashboards).
3. The "audit your auditor" week
For one full week, we ran KCode against itself and another tool (Inquisitor, our agent QA daemon) against KCode. The goal was to find every silent failure in our own pipeline before customers did. Inquisitor surfaced 8+ silent-failure bugs in the first week: hallucinated tool results that propagated through the pipeline, exit-code-0 hangs that no human or test suite had caught, edge cases where verifier rejection was masked as success. Every one of those is now a test case in our CI.
If you ship developer tooling, audit your auditor. It's the highest-leverage week of QA you can do.
How to install and use it
KCode is distributed as binaries (Linux x64/ARM64, macOS Apple Silicon) and an npm package. Three install paths:
# Option A: one-line install (recommended for local use)
curl -fsSL https://kulvex.ai/kcode/install.sh | sh
# Option B: npm
npm install -g @astrolexisai/kcode
# Option C: GitHub Action (drop into .github/workflows)
- uses: AstrolexisAI/kcode-action@v1
with:
target: ./src
severity: medium
For CI integration, the GitHub Action publishes SARIF to GitHub Code Scanning, which means the findings show up in the Security tab and as inline PR comments. No additional dashboard required.
For local development, kcode scan ./src --verifier-model qwen3.6-heretic runs a full pass and writes the report to stdout. If you have a Mac with 32GB+ unified memory, MLX serves the verifier directly. If you have a GPU server, point KCode at any OpenAI-compatible endpoint serving the model you want.
Free tier is permissive: full feature set, no source-code upload, you bring your own model. Pro at $19/month adds priority pattern updates, the curated weekly verifier model release, and access to the cascade ensemble pre-configured. Pricing details and binaries.
The honest part: where we are with revenue
I'm not going to pretend KCode is a runaway hit. Here's where we actually are:
- Revenue: $0 confirmed Pro subscribers as of this writing. The free tier has users — actual installs, actual scans, actual SARIF reports landing in CI — but the Pro conversion hasn't started.
- Phase 1 goal: 10 paying subs or 2 paid audit engagements. That's the bar we set for "this is a real product."
- What we know works: the technical core. Precision is real, the patterns are good, the verifier doesn't hallucinate, the SARIF output is clean. The bug we found in NASA's code wasn't a one-off.
- What we're testing: whether the buyer who can't ship code to Snyk actually exists in the volume we hope. Our hypothesis is yes — defense, healthcare, EU SaaS, anyone with GDPR data residency, anyone with NDA constraints. We're going to find out over the next two quarters.
I'm sharing this because the indie software world is full of "we're crushing it" posts that don't match the financial reality, and that makes it harder for anyone building something legitimate to talk straight. KCode is a real tool that solves a real problem. We don't yet know if it'll be a business. That's where we are.
Who this is for
If your team is in any of these buckets, KCode is built for you:
- You have source code that contractually cannot leave your infrastructure. Defense, healthcare, financial services with strict residency.
- You run on-prem CI and the SaaS SAST tools don't ship a self-hosted edition you can actually afford.
- You've tried Snyk/SonarQube/GHAS and find the noise level untenable. You want a tool that fires less and lands more.
- You're philosophically opposed to your code training someone else's model. Reasonable position.
- You're a security consultant doing one-off engagements and want a tool that runs on your laptop without phoning home.
If your team is happily on a SaaS SAST and your auditors don't care, KCode is probably not for you. That's fine. We're not trying to displace the SaaS market — we're serving the chunk of it that can't use SaaS at all.
— Bruno Galtranch, founder, AstroLexis LLC. If you're evaluating KCode for your team or want to talk about a paid audit engagement: contact@astrolexis.space.
Top comments (0)