DEV Community: onfafanutifafa

self_check

onfafanutifafa — Mon, 13 Jul 2026 13:17:18 +0000

onfafanutifafa

Jun 5

We scanned 20 AI repos for leaked keys. Every scanner alert was a false positive.

#ai #programming #security

3 min read

How a public AI-app SAST benchmark made our detector 15% better in one afternoon

onfafanutifafa — Mon, 13 Jul 2026 13:08:05 +0000

Two days ago Fafa flagged a problem with our outreach pitch: we tell prospective users getdebug catches AI-app security bugs, but our own benchmark was a handful of one-bug micro-fixtures. Useful for unit testing the detector. Useless for the question a real developer asks: does this catch the bugs my app actually has?

So we built CodeSecBench Tier C — a public corpus of six deliberately-vulnerable AI apps, each ~40 files, each carrying 12–18 labeled bugs across all six AI-app security categories. The truth lives in a separate public repository so scanners never see the labels at scan time. The first two repos are live; this post is the first calibration cycle.

The benchmark, in one paragraph
Six target repos under getdebug-ai/cst-* (cst- = CodeSecBench Tier C), each a different stack: Next.js + Vercel AI SDK, Vite + Express + LangChain.js, SvelteKit + Anthropic, Express + tool-calling agent, FastAPI + Python, CrewAI multi-agent. Total corpus: ~89 vulnerable rows + 33 safe near-misses + 27 borderline cases = 149 labeled lines. Each repo has a known-safe.ts hallucination control file — any scanner finding inside it is a guaranteed false positive. The truth lives at getdebug-ai/codesecbench-truth; the "don't peek" norm is documented in the README, same as every honest public benchmark.

File paths in each repo are randomized — domain-appropriate, not template-matched. A vendor allowlisting lib/user.ts won't generalize from repo #1 to repo #2 (server/services/personalization.ts). The benchmark measures detection skill, not memorization.

Running getdebug 0.5.1 against the first two repos
We ran getdebug analyze . --quiet --json against the two completed repos and scored against the truth file using a span+tolerance JOIN scorer (any finding whose line span overlaps a truth row's span, ±5 lines, credits the row).

23% recall. Far short of where we'd need to be for an outreach pitch. But the data is useful — the misses cluster. Both repos missed the same canonical CWE patterns:

Shell & SQL injection via args.X — the canonical SDK tool-callable shape. Repo #1's execAsync(args.command) and repo #2's sql.unsafe(args.query) are both classic CWE-78 / CWE-89 sinks. The detector's existing regex only matched exec(tool.input.X) — the SDK's typed-tool-ref form. Real code uses args.X, where args is the typed function parameter. API key returned in JSON response body — the second-most-common Next.js / Express leak after NEXT_PUBLIC_. Pattern is Response.json({apiKey: process.env.X_API_KEY}). The detector had no rule for this shape at all. The fix: two new regexes, sixty minutes Both gaps are regex-detectable. Pattern A — the args.X form — needed only to add args as a valid identifier prefix alongside tool, block, toolUse, etc., plus extend the sink list to include SQL: sql.unsafe, db.unsafe, pool.unsafe, db.prepare:

var unsafeToolOutputArgsRe = regexp.MustCompile(
\b(?:exec|execSync|execAsync|spawn|spawnSync|eval|run|runCommand| runSync|sql\.unsafe|db\.unsafe|db\.query|pool\.unsafe|pool\.query| client\.unsafe|client\.query|db\.prepare) \s*\(\s*[^)]{0,160}?\bargs\.\w+,
)
Pattern B — the key-in-response form — is a fresh detector with a response-context anchor:

var keyInResponseRe = regexp.MustCompile(
(?s)(?:Response\.json|res\.json|res\.send|return\s+json| return\s+Response\.json) \s*\(\s*\{[^}]{0,400}? (?:apiKey|api_key|secret|token|key)\s*:\s* process\.env\.[A-Z][A-Z0-9_]*(?:KEY|TOKEN|SECRET),
)
Both new patterns ship with explicit negative tests. Parameterized SQL via the tagged template (sqlSELECT * FROM users WHERE id = ${userId}) doesn't fire. Legitimate SDK construction (new OpenAI({apiKey: process.env.OPENAI_API_KEY})) doesn't fire. The point is to catch new shapes, not over-fire on safe ones.

The numbers after 0.5.2

The 50pp jump on client-side-llm-key means both repos' #2 carrier (server route returning the key in a response body) is now caught. The 33pp jump on unsafe-tool-output means the canonical CWE-78 and CWE-89 sinks — execAsync(args.command) and sql.unsafe(args.query) — are caught. These are real-world patterns, not contrived; we saw both in the wild while building the corpus.

What didn't move: unsafe-role-merge, prompt-injection, and one of the unbounded-stream rows in repo #1. The unbounded-stream miss is a label issue — the detector hit at line 42 (stream: true) while the truth label is at line 48 (the for-await loop), 6 lines apart, outside the ±5 tolerance. Widening that label to a span fixes it; we'll do that in the v0.1.1 truth release. The role-merge and prompt-injection misses are real detector gaps, and they're the next calibration target.

The loop
Each repo becomes a learning artifact. Add a repo → score with the current tool → identify the gaps → ship detector fixes → re-score all earlier repos with the new version → build the next repo → repeat. The /bench page tracks the time-series. Each calibration cycle gets a blog post.

Next: cst-sveltekit-stream (#3) is in author-mode now. SvelteKit puts the system message in a separate top-level parameter (anthropic.messages.create({system: "...", messages: [...]})), not as a role inside the messages array. The existing role: "system" detector won't see it. That's the kind of stack-specific blind spot the corpus exists to surface.

Try it yourself
If you're building a SAST tool that targets AI-app categories, CodeSecBench is for you. The corpus + truth file are MIT licensed, and there's a vendor-side scorer at codesecbench-truth/score.js (zero deps, ~120 lines). Run your tool against the public targets, JOIN against the truth, open a PR with your results. The don't-peek norm is the only ask.

If you're a developer wondering whether your own AI app has any of these patterns today: npm i -g @getdebug/cli and getdebug analyze .. 0.5.2 lands on npm with the next release; the LLM-augmented pass (--local-llm) is already there.

I benchmarked Python AI-app security scanners. Here's what each catches.

onfafanutifafa — Fri, 05 Jun 2026 13:12:00 +0000

This week I shipped Python AI-app regex prefilters in getdebug 0.4.0 and benchmarked them against Bandit and Semgrep on real Python code. Here are the numbers and what each tool actually catches.

The four tools

Bandit (PyCQA) — the Python-OSS standard security linter. Hand-written rules, free, fast, Python only.

Semgrep — multi-language SAST with community rule packs. Hand-written rules, free, fast.

vulnhuntr (Protect AI, open source) — the stated category leader for LLM-driven AI-app static analysis. Python only.

getdebug — pattern-based regex prefilters in JS/TS + Python (new in 0.4.0). Plus optional local-LLM SAST via Ollama (free, on-device) and hosted (paid).

Test 1 — paired vulnerable/safe fixtures

10 hand-written Python fixtures, 5 vulnerable + 5 safe, one pair per AI-app category (pii-in-prompt, unsafe-role-merge, prompt-injection, unbounded-stream, unsafe-tool-output).

Tool        TP  FP  FN   Precision  Recall
getdebug     5   0   0    100%       100%
bandit       1   1   4    50%        20%
semgrep      1   1   4    50%        20%
vulnhuntr    —   —   —    (unable to complete; see below)

Bandit and Semgrep both catch the unsafe-tool-output fixture via their generic subprocess.run(shell=True) rules. That's a TP on the vulnerable variant. But they also fire on the safe variant — the allowlist-then-run pattern:

# Safe pattern — Bandit + Semgrep both flag this as a FP
ALLOWED = {"hosts": "cat /etc/hosts", "uptime": "uptime"}
def handle(tool_call):
    cmd = ALLOWED.get(tool_call.input.tag)
    if not cmd: return "rejected"
    return subprocess.run(cmd, shell=True, capture_output=True).stdout

Neither tool knows cmd came from a static dict, not the model. They see shell=True and fire. getdebug's regex specifically requires the tool_call.input.X / block.input.X reference in the sink arg, so the allowlist-then-run pattern stays clean.

Both tools miss the other four behavioural categories (pii-in-prompt, unsafe-role-merge, prompt-injection, unbounded-stream) entirely. The rule packs don't contain patterns for {"role": "system", "content": f"...{name}..."}. That's the gap.

Test 2 — real-world signal/noise

We ran all three (working) tools against simonw/llm, Simon Willison's clean CLI for LLMs, 48 Python files.

Tool        Total findings    Signal
bandit      1,189            1,158 are 'assert_used' (pytest);
                              zero AI-app coverage
semgrep     3                3 generic-SAST hits;
                              zero AI-app coverage
getdebug    6                6 AI-app findings: 1 prompt-injection,
                              5 unbounded-stream

Bandit's 1,189 findings on 48 files is almost entirely the assert_used warning on pytest assertions — a well-known default everyone disables in real configs. Semgrep's 3 findings are real but none AI-app specific. getdebug's 6 are all AI-app categorized.

About vulnhuntr

vulnhuntr is the stated category leader. We wanted a clean cross-check. We couldn't get one:

--llm claude-code mode (no-API-key option) crashes with ModuleNotFoundError in 1.2.2.
--llm gpt with gpt-4o-mini fails pydantic-validation on the response.
--llm gpt with gpt-4o hits OpenAI's default 30K TPM rate limit on small accounts.
Default file-selection heuristic identifies "network-exposed" entry points — simonw/llm is a CLI, so vulnhuntr selected zero files to analyze.

We'll re-benchmark when its 2026 stack stabilises.

What this means for you

If you ship Python code that calls an LLM, run all three. They're complementary:

bandit -r .                              # general Python hygiene
semgrep --config auto .                  # cross-language SAST coverage
npx @getdebug/cli@0.4.0 analyze .       # AI-app behavioural patterns

None of them subsume the others. The first two catch general SAST; getdebug catches the "serialised the whole user object into the prompt" class that you can't hand-write a sustainable rule for in generic SAST.

Reproduce every number at getdebug.dev/bench. Corpus and methodology are open at getdebug-ai/codesecbench.

read it here https://www.getdebug.dev/blog/python-ai-app-prefilters

We scanned 20 AI repos for leaked keys. Every scanner alert was a false positive.

onfafanutifafa — Fri, 05 Jun 2026 01:49:05 +0000

getdebug ships a secret scanner as part of its free tier — committed credentials are the one finding category we surface without an account, because the cost of a leaked key is high enough that even a 30-second check is worth running. So we did the obvious thing: we ran our scanner against 20 public AI-starter repos on GitHub, expecting to find some real leaks. The premise was that someone in a corpus of mid-popularity AI scaffolds must have committed a real OpenAI key.

Every single scanner alert was a false positive.

The numbers
Across the 20-repo sweep, our scanner produced 12 alerts at critical severity. Zero of them were real credentials. Two repos accounted for most of the noise:

stackitcloud/rag-template — 7 scanner alerts, all false positives. Every hit was a placeholder value in a .env.template file (e.g. STACKIT_VLLM_API_KEY=your-stackit-vllm-api-key) or an import.meta.env.X env-var name read. None of them were real credentials.
A popular Claude Code starter template — 5 scanner alerts, all false positives. Three were "Private key block" matches inside CHANGELOG.md and SNAPSHOT.md showing PEM-formatted example output. The other two were the funniest: PEM markers appearing in comments next to grep patterns and redaction regexes that exist to strip the same shape. Secret detectors tripping on secret-detector code.
What we shipped because of it
A false-positive rate that high on a corpus this small is a real problem. So we read every hit, classified the failure modes, and shipped three detector rules into both the CLI (@getdebug/cli) and the hosted analyze worker:

Broader env-template matching. Any file whose path or extension matches .env.template, .env.example, .env.sample, or a parent directory named examples/ is treated as template by default. Findings inside still surface, but at info severity, not critical.
Doc-context suppression. Hits inside fenced code blocks in markdown, or under headings like "Example output" / "Sample response", no longer trip critical severity. The detector still records them — they just don't page anyone.
Env-var-read skip in entropy. The entropy-based detector now recognizes process.env.X, import.meta.env.X, and os.environ["X"] as identifier reads, not opaque high-entropy strings. The variable name being long and random-looking doesn't make the access a leak.
Re-running the same 20-repo sweep after these three rules landed: 83% reduction in critical false positives. Two FPs remain — both in the same Claude Code starter template — and they need a fourth rule we haven't shipped yet (PEM-in-comment suppression, which requires cross-language comment parsing). The post is upfront about that: it's an open detector gap, not a closed one.

Why publish this
Every security vendor's landing page claims a low false positive rate. Almost nobody shows the work. We'd rather be the team that publishes its scanner being wrong, ships fixes in public, and re-runs the numbers — because that's what we'd want from a tool we were thinking of buying.

The corpus is reproducible. The methodology, the per-tool numbers, and the JSON output schema live under /bench. If you find a case the scanner gets wrong on your own code, the issue tracker on github.com/getdebug-ai/cli is the right place to put it. Detector tuning is still early, and the easiest way to improve it is to point us at the noise.

Try it
The secret-scanning detectors that came out of this sweep are in the free tier of the CLI. No account needed; nothing leaves your laptop.

macOS / Linux

brew install getdebug-ai/tap/getdebug
getdebug analyze .
Full install instructions and the rest of the commands are in the docs.

youcan read it here https://www.getdebug.dev/blog/credibility-scan

I used to guard buildings. Now I guard codebases.

onfafanutifafa — Wed, 03 Jun 2026 23:25:41 +0000

I come from a physical security space, mainly man-guarding and asset protection. I recently took the challenge to venture into information and cyber security. So far I can say the mentality for both is the same; they differ in technique but the outcomes are the same, in that both are primarily focused on asset or data privacy and protection. Offensive cybersecurity often happens between nation states, but that does not mean corporate entities or individuals do not indulge. They do so cautiously, because breaking into unauthorised networks and domains is a crime. More often than not, countries get away with it, but corporations and individuals face the sharp end of the sword. Offensive information and cyber security experts act under strict regulations and laws to safeguard the data and sovereignty of corporations and nations. This is to lay emphasis on the sameness of the core principles of both physical and information and cyber security, in that they are focused on protection rather than exploiting.

The difference: they differ in technique in the sense that the tools they need to successfully manoeuvre a problem are different, but the goals are the same. Private security, like health care, only becomes top of mind when things go wrong. Research shows that businesses and people see security as critical to their business and to their brand, but fewer people actually reach for their wallet.[1, 2] The price we pay for the lack of security outstrips the immediate cost of buying one. This is why I am a security man. When I talk to my clients I always make the same point: security is a mindset shift. You can buy security, but you can never buy safety. Selling you security does not mean I can promise you will never be breached, because security by its very nature is not absolute. The systems you call safe is the same systems someone else walks through with ease. For example, when Anthropic launched Mythos[3, 4, 5], it uncovered tens of thousands of vulnerabilities[6, 7] in systems long assumed to be safe — including a twenty-seven-year-old flaw in OpenBSD, one of the most hardened operating systems in the world.[3, 7] Safe, until it was not. So I do not sell certainty. I am honest about my methods and honest about the
odds, and that honesty is what earns a client’s confidence in how protected they really are. A client who understands the true odds is in a far stronger position than one who has been sold the impossible.

Now more than ever, everything is moving to the web or a network, and the cost of moving has drastically fallen. Moving is the no-brainer every organisation goes to in order to store information and customer data. The great multiplier and enabler of the wind of change is artificial intelligence. AI is great at what it does — it generates working code faster than any team can review it. For people like me who have come to understand what it is, I use it with caution, and this is why I think protecting networks and security in general is going to see an uptick in growth in the coming years. Senior programmers have admitted they cannot keep up with AI-written codebases: the issues are rarely simple, and the sheer quantum of code they have to comb through to find them is overwhelming.[8, 9, 10] Past a few hundred lines a review stops being a review and becomes a rubber stamp[11]; so when a developer is handed ten thousand lines of clean, confident-looking AI code, the easy choice — the human choice — is to trust it and ship it as is.[8] The bug ships with it. This is why I strongly believe AI can be the key mediator here — to narrow down bugs for devs and dev teams to navigate successfully.

This is why I created getdebug.dev. getdebug.dev is an AI-powered codebase analyser and auto-fixer. It works simply: you connect a codebase or repository from a version control platform like GitHub or GitLab, and getdebug indexes it. That index is what makes it possible to analyse the code and detect bugs, business-logic gaps, and broken access controls. And that last one is the whole point — broken access controls are a security failure, not just a coding one. This is where my two worlds meet. Whether you are guarding a building or guarding a codebase, the job is the same: find the gap before someone else does. getdebug is how I bring the protection mindset to the place everything is now moving — the code itself.

Now, I am not the first person to think of this. There are good tools out there already doing code review and bug hunting, and some of them are very good. I know them. I did not build getdebug because the
others are bad. I built it because they think like engineers and I think like a security man. To most of these tools a bug is a bug, one more item on a list to clean up. To me every bug is a door. Some doors lead to nothing, and others lead straight into the house. A broken access control is not a code quality problem, it is an unlocked door waiting for someone to walk through. I cannot unsee it that way. So getdebug does not just ask “is this code clean,” it asks “where can someone get in.” That is the difference, and it is a difference in how I see the work, not just in features. Two things follow from that. The first is that getdebug is built for the new kind of software people are shipping now, the AI apps. The mistakes AI apps make are their own breed: prompt injection, leaking keys to the browser, trusting output they should never trust. Most tools catch these by accident, if at all. getdebug looks for them on purpose, because that is where the doors are being left open today.The second is privacy, and I mean a real choice, not a slogan. You can connect your repo and let getdebug work in the cloud, or you can run it entirely on your own machine where your code never leaves your hands. Some teams cannot let their code travel, and they should still be able to secure it. So I built both. But the part I care about most is that getdebug learns. When you tell it a flagged line is fine because you meant it that way, it remembers, and it stops bothering you about it. Good review tools do this for code style now. getdebug does it for security — it learns which doors you have deliberately left open and which it should keep watching, and it gets sharper at the difference the longer it guards your codebase. That is the part I am building everything else around. You can try it at https://www.getdebug.dev/

References

[1] Cybersecurity Dive. “Are businesses underinvesting in cybersecurity?”
https://www.cybersecuritydive.com/news/security-budgets-enterprise-CISO/595036/
[2] Help Net Security. “Cybersecurity spending keeps rising, so why is business impact still hard to explain?” (Jan 15, 2026).
https://www.helpnetsecurity.com/2026/01/15/expel-cybersecurity-investment-decisions/
[3] Anthropic. “Claude Mythos Preview” (primary source — Mythos launch and the 27-year-old OpenBSD SACK flaw).
https://red.anthropic.com/2026/mythos-preview/
[4] Anthropic. “Project Glasswing: Securing critical software for the AI era.” https://www.anthropic.com/glasswing
[5] TechCrunch. “Anthropic scales Claude Mythos to critical infrastructure in 15+ countries” (Jun 2, 2026).
https://techcrunch.com/2026/06/02/anthropic-scales-claude-mythos-to-critical-infrastructure-in-15-countries/
[6] SecurityWeek. “Anthropic: Mythos Detected 23,000 Potential Vulnerabilities Across 1,000 OSS Projects.”
https://www.securityweek.com/anthropic-mythos-detected-23000-potential-vulnerabilities-across-1000-oss-projects/
[7] Crypto Briefing. “Anthropic’s Mythos detects 23,000 vulnerabilities in open-source projects, including a 27-year-old OpenBSD
flaw.” https://cryptobriefing.com/anthropic-mythos-open-source-vulnerabilities/
[8] GitClear. “AI Copilot Code Quality: 2025 Research” (10M+ commits; code churn, copy/paste, the “illusion of correctness”).
https://www.gitclear.com/ai_assistant_code_quality_2025_research
[9] The Register. “AI-authored code contains worse bugs than software crafted by humans” (Dec 17, 2025).
https://www.theregister.com/2025/12/17/ai_code_bugs/
[10] arXiv. “Human-Written vs. AI-Generated Code: A Large-Scale Study of Defects, Vulnerabilities, and Complexity” (2025).
https://arxiv.org/abs/2508.21634
[11] Salesforce Engineering. “Scaling Code Reviews: Adapting to a Surge in AI-Generated Code” (on review degradation past a few
hundred lines). https://engineering.salesforce.com/scaling-code-reviews-adapting-to-a-surge-in-ai-generated-code/