Eldor Zufarov

Posted on Mar 12 • Edited on Mar 15

You Don't Have a Vulnerability Problem. You Have a Noise Problem.

#security #devops #ai #opensource

What happens when you run a modern security pipeline against real-world AI infrastructure codebases — and let the signal speak for itself.

The Alert Fatigue Economy

Let's put real numbers on the table.

A Go service. 218 source files. Production codebase. Raw scanner output: 98 findings — every single one labeled HIGH. After path analysis and context validation: 24 production findings. The remaining 74? Test scaffolding. Never runs in production. Cannot be triggered by any external actor.

Your team just spent hours triaging what a pipeline should have filtered in seconds.

This is not an edge case. This is Tuesday.

The core issue isn't that scanners are bad. It's that they were never designed to understand your project. They don't know:

Whether a flagged file is test code or production code
Whether a vulnerable function is publicly reachable or locked behind internal trust boundaries
Whether exec.Command in Go actually invokes a shell — it doesn't, by default

They produce a flat list. Same weight. Same severity. No signal.

And when everything is critical — nothing is.

The result isn't a security report. It's alert fatigue with a dashboard.

Pattern Detection Is a Solved Problem. Reachability Isn't.

Your scanner found the vulnerability. It has no idea if anyone can reach it.

Semgrep, Bandit, custom rules — they will find every dangerous sink in your codebase. Fast, consistent, at scale. That's not the hard part anymore.

The hard part is what happens after.

Take exec.Command in Go. A raw scanner flags it. Every time. HIGH severity. Command injection risk. What the scanner doesn't tell you:

exec.Command in Go does not invoke a shell by default
If the argument comes from a hardcoded config constant — it's not exploitable
If the function lives in devtools/ and never compiles into the production binary — it doesn't exist at runtime

Same rule. Same severity label. Three completely different risk profiles.

This is where orchestrated pipelines separate from raw scanner output.

Layer 1 — Weighted context scoring: A finding in test/, mock/, bench/ gets its risk weight reduced dramatically. A finding in a CI/CD pipeline config gets elevated. Same rule, different weight based on where it lives.

Layer 2 — Reachability analysis: Static taint tracking traces the path from user-controlled input — HTTP params, CLI args, environment variables — through your codebase to the dangerous sink. No external path to the sink? Fundamentally different risk than a direct, unguarded path from a public API handler.

Layer 3 — AI validation with full context: Not just the flagged line. The enclosing function, import chains, call graph, data flow. The model determines one thing: is this actually exploitable in this specific context, or does the pattern match while the exploit path doesn't exist?

The result isn't a longer report. It's a shorter one — with every finding on it worth acting on.

The Metric Problem

"We closed 200 vulnerabilities this quarter."

So what.

That number means nothing without context. 200 findings closed — in test code that never ran in production. 200 findings closed — none of which had a reachable exploit path. 200 findings closed — while 3 verified SQL injection points in your public API sat in the backlog.

Count is not posture. Velocity is not security.

CISOs present findings-closed to boards. Boards approve budgets based on findings-closed. Teams get measured on findings-closed. And the actual attack surface stays exactly the same.

Here's what a useful security metric looks like: Security Posture Index (SPI).

Not a count. A score. Calculated from weighted, context-adjusted, reachability-validated findings — not raw scanner output. A project with 200 findings all in test scaffolding scores differently than a project with 8 findings, three of which are AI-verified injection vulnerabilities in a public-facing handler. Same tool. Opposite risk profiles. Completely different scores.

And one more layer that most pipelines skip entirely: Credibility.

Any scoring system can be gamed. Exclude high-risk directories from scan scope. Run the scanner against a subset of the codebase. Tune rules for false-negative inflation. A credibility engine detects these anomalies — unusual finding-to-file ratios, sudden score jumps, profiles where 100% of findings land in test code. When anomalies appear, the score is flagged as unreliable.

Because a metric you can manipulate isn't a security signal. It's a compliance theater prop.

The Real Test: 7 Codebases, One Pipeline

Theory is easy. So we ran the pipeline against 7 real-world open-source AI infrastructure projects — actively maintained, production-grade, varying in size and language stack. No cherry-picking. No controlled benchmarks. Just the tool and the code.

The results across all 7 repositories:

	Result
Total raw findings	~7,600
Findings in test/noise context	~7,600 (100%)
AI-verified, reachable, actionable	1
Responsible disclosure letters sent	1

One finding. Out of seventy-six hundred.

That's not a failure of the tool. That's the tool doing exactly what it should — refusing to call something a vulnerability until it can prove someone can reach it and exploit it.

What the one finding looked like

In one repository, the pipeline flagged hardcoded credentials in an example file — a token and a database connection string written directly into source code intended as a "getting started" template.

The values looked like placeholders. That's the problem.

"Getting started" templates are the code that gets copy-pasted into production unchanged. A developer following a quickstart won't necessarily know to replace inline strings with environment variables. The result: quietly exposed infrastructure, by default, at scale across every user who followed the example.

The fix was one line. The risk without it was real. The AI verified it with 90% confidence. A responsible disclosure letter was sent.

What the other 7,599 findings looked like

One repository alone contributed over 4,400 findings of a single type: BANDIT_B101 — use of assert detected. This is a Python code quality note. Not a vulnerability. Not a risk. A style suggestion that Bandit emits on every assert statement in every file.

88% of one project's entire report. Zero actionable findings.

Another repository flagged 266 instances of potential command injection — all of them in vendor-bundled, minified JavaScript files (CSS frameworks, PDF renderers). The scanner matched a pattern. The pattern existed in code that nobody wrote, nobody maintains, and nobody can meaningfully patch.

If either of these reports landed in an engineer's queue unfiltered, you'd lose days.

What the Signal-to-Noise Ratio Actually Means

Across 7 codebases, the verified signal rate was approximately 0.013%.

That's not a critique of the projects. These are well-maintained, professionally developed repositories. The raw findings reflect scanner sensitivity, not poor engineering.

It's a critique of treating raw scanner output as security intelligence.

The one finding that mattered — the hardcoded credential in an example file — would have been buried in a flat report of thousands. Or worse, marked LOW severity and deprioritized, because automated systems don't understand supply chain risk through documentation.

The pipeline found it not because it scanned harder. Because it understood context.

The Question Worth Asking

Most security pipelines today answer: "How many patterns did our scanner match?"

The question that matters: "Is our executable attack surface smaller than it was 90 days ago?"

If your current tooling can't answer that — you're not measuring security. You're measuring activity.

Raw scanners are data sources. Treat them like final authorities and you're triaging noise for a living.

Auditor Core is a security auditing engine that combines weighted context scoring, static taint tracking, and AI-verified reachability analysis to separate signal from noise in real codebases.

Try It Yourself

The tool described in this article is available as a free demo — 3 runs, no signup, no telemetry.

git clone https://github.com/auditor-core-systems/auditor-core-demo.git
cd auditor-core-demo
bash start.sh
./audit /path/to/your/project

Run it against your own codebase. See your SPI score. Check what actually reaches production.

→ GitHub: auditor-core-demo

For PRO license (unlimited runs + AI advisory): eldorzufarov66@gmail.com

DEV Community