Patience Mpofu

Posted on May 9

False Positives in SAST — How I Built Suppression Into My Scanner and Why It Matters

#appsec #devops #testing #security

There's a failure mode that kills security tooling programmes quietly, without drama, and it's not a technical failure.

It's a trust failure.

It goes like this: a team enables a SAST scanner. The scanner fires on 200 things. Engineers triage 40 of them and discover that 25 are false positives. They fix the 15 real findings, suppress the 25 false positives, and then face another 160 findings they haven't looked at yet. Two sprints later, nobody is triaging anymore. The scanner still runs. The reports still generate. Nobody reads them. The security programme is theatre.

False positives are the mechanism by which this happens. Not because developers are lazy — because time is finite and trust is fragile. If a scanner cries wolf enough times, engineers stop listening. That's rational behaviour, not negligence.

This article is about how I thought about false positives when building my SAST tool, what I built to manage them, and why the suppression system design matters as much as the detection rules themselves.

What a False Positive Actually Costs

Before getting into solutions, it's worth being precise about the cost.

A false positive in a SAST scanner costs:

Triage time — an engineer has to read the finding, understand the rule, examine the code in context, and reach a conclusion. Even for an experienced engineer, that's 5–15 minutes per finding for anything non-trivial.
Trust capital — every false positive is a small withdrawal from the trust account between the security team and the engineering team. Trust capital is finite and slow to rebuild.
Attention budget — the more false positives exist, the less attention real findings receive. This is the most dangerous cost. Security is fundamentally an attention allocation problem. A scanner with a 40% false positive rate isn't 40% less useful. It's potentially useless, because the signal-to-noise ratio has collapsed to the point where engineers can't efficiently find real findings among the noise.

The Three Sources of False Positives

Not all false positives are the same. Understanding where they come from determines how to address them.

1. Context-Blind Pattern Matching

This is the most common source in regex-based scanners. The pattern matches the text but doesn't understand what the code is doing.

The MD5 example I've used throughout this series is the canonical case:

# False positive — MD5 for file integrity, not passwords
file_hash = hashlib.md5(file_content).hexdigest()

# True positive — MD5 for password storage
stored_password = hashlib.md5(user_password).hexdigest()

Both lines match the pattern \bmd5\s*\(. Only the second is a vulnerability. A regex scanner cannot tell them apart without understanding the semantic context — what type of data is being hashed.

2. Safe Framework Usage That Looks Dangerous

Some frameworks make inherently dangerous operations safe through abstraction. The dangerous-looking code is actually fine because the framework handles the dangerous part.

// Looks like SQL injection — it's not
// Spring Data JPA with @Query annotation handles parameterisation
@Query("SELECT u FROM User u WHERE u.email = :email")
User findByEmail(@Param("email") String email);

A naive injection rule that flags anything resembling a SQL query with a variable near it would fire here. The JPA annotation system makes this perfectly safe — but the scanner doesn't know that.

3. Test and Configuration Code

Test files are full of patterns that would be alarming in production code:

# test_auth.py
def test_jwt_none_algorithm_rejected():
    # Testing that we correctly REJECT the none algorithm
    malicious_token = jwt.encode({"user": "admin"}, "", algorithm="none")
    response = client.post("/auth", json={"token": malicious_token})
    assert response.status_code == 401  # Should be rejected

This test is doing exactly the right thing — verifying that the application rejects the none algorithm attack. But a scanner looking for algorithm="none" will flag it as AUTHN-001 without understanding that this is a negative test case.

What I Built: The Suppression System

My scanner supports two suppression mechanisms, each designed for different scenarios.

Inline Suppression Annotations

The simplest mechanism: a comment on the same line as the finding tells the scanner to skip it.

file_hash = hashlib.md5(file_content).hexdigest()  # sast-ignore

I support two annotation formats — # sast-ignore and # nosec — because nosec is the Bandit convention and teams coming from Bandit shouldn't have to change their existing annotations.

The scanner checks for these annotations before reporting a finding. If either is present on the matched line, the finding is suppressed silently.

The problem with silent suppression: It's invisible. If every suppression silently disappears from the report, there's no way to audit whether suppressions are legitimate or whether engineers are using them to hide real findings.

Suppression With Justification

The better pattern — and what I recommend teams enforce in code review — is annotating why the suppression is valid:

# MD5 used for file integrity checking only, not credential storage
# Tracked in SEC-REVIEW-2024-041 — confirmed non-sensitive context
file_hash = hashlib.md5(file_content).hexdigest()  # sast-ignore

The annotation still suppresses the finding, but the comment creates a paper trail. When a security audit happens — and it will — every suppression has a documented rationale that a reviewer can evaluate. "We reviewed this and it's fine because X" is defensible. A bare # sast-ignore with no context is not.

The Suppression Inventory in JSON Output

Here's a design decision I'm particularly pleased with: suppressed findings don't disappear from the JSON report. They appear in a separate suppressed_findings array:

{
  "findings": [
    {
      "id": "CRYPTO-002",
      "title": "SHA-1 Usage Detected",
      "severity": "HIGH",
      "file": "src/utils/crypto.py",
      "line": 47
    }
  ],
  "suppressed_findings": [
    {
      "id": "CRYPTO-001",
      "title": "Weak Hashing — MD5",
      "severity": "HIGH",
      "file": "src/utils/file_integrity.py",
      "line": 23,
      "suppression_reason": "MD5 used for file integrity only — sast-ignore"
    }
  ],
  "summary": {
    "total_findings": 1,
    "suppressed": 1,
    "by_severity": { "HIGH": 1 }
  }
}

This means:

The pipeline counts only active findings when deciding whether to fail
The full report shows both active and suppressed findings
Security reviewers can audit suppressions without looking at individual source files
Trend analysis can track suppression rates over time alongside finding rates That last point matters for measuring programme health. If your suppression count is growing faster than your finding count, something is wrong — either your rules are too noisy, or engineers are gaming the system.

Confidence Levels as Pre-Emptive Noise Reduction

The suppression system deals with false positives after they appear. Confidence levels deal with them before.

Every pattern in my rule engine declares a confidence level:

patterns:
  - regex: 'pickle\.loads?\s*\('
    confidence: HIGH     # Almost always a real finding
  - regex: 'unserialize\s*\('
    confidence: MEDIUM   # Real finding in PHP web context, benign in CLI context
  - regex: 'request\.headers\.get\(["\']Origin["\']\)'
    confidence: LOW      # Could be proper allowlist implementation

Confidence levels serve two purposes.

For engineers reading findings: Confidence communicates how much manual review a finding deserves. A HIGH confidence finding deserves immediate attention. A LOW confidence finding is a prompt to look at the code and make a judgment call. Without this signal, every finding looks equally important — which means either everything gets treated as urgent (unsustainable) or everything gets triaged with the same low attention (misses real issues).

For pipeline configuration: Teams can configure their build gate to fail only on findings above a confidence threshold:

# Fail on HIGH severity + HIGH confidence only
python main.py ./src --fail-on HIGH --min-confidence HIGH

# See everything including LOW confidence findings in audit mode
python main.py ./src --fail-on none --min-confidence LOW

This is a more nuanced gate than severity alone. A MEDIUM severity finding with HIGH confidence (this is almost certainly real, and it's moderately serious) might warrant blocking. A HIGH severity finding with LOW confidence (this is probably bad, but it might be fine) might not. The two dimensions together give you much more precise control over your signal-to-noise ratio.

The Suppression Review Process

The suppression mechanism is only as good as the governance around it. A suppression system without a review process is just a way to silence the scanner faster.

Here's the process I'd implement in a team setting:

Step 1 — Developer identifies a finding they believe is a false positive.
They don't suppress it immediately. They raise it in the PR for discussion.

Step 2 — The team reviews the claim.
Is the developer's reasoning sound? Is the code actually safe in context? Does anyone have concerns? This is a two-minute conversation in most cases, not a security committee meeting.

Step 3 — If accepted, the suppression is added with justification.
The # sast-ignore goes in with a comment explaining why. The suppression is visible in the PR diff — it can't be hidden.

Step 4 — The suppression is tracked.
In the JSON report, in a suppression registry spreadsheet, or in a dedicated Notion page — wherever works for your team. What matters is that someone periodically reviews the suppression inventory and asks: are these still valid?

Step 5 — Periodic suppression review.
Suppressions rot. Code changes. The context that made a suppression valid six months ago may no longer apply. A quarterly review of active suppressions — not of the whole codebase, just the suppression inventory — keeps the list honest.

Tuning Rules to Reduce Systemic False Positives

When a specific rule consistently generates false positives across the codebase, the right answer isn't to suppress every instance — it's to tune the rule.

The MD5 rule is a good example. Rather than flagging every md5( call at HIGH confidence, I could tighten the pattern to focus on contexts that suggest credential handling:

Before (noisy):

patterns:
  - regex: '\bmd5\s*\('
    confidence: HIGH

After (tighter):

patterns:
  - regex: 'md5\s*\(\s*(password|passwd|pwd|secret|credential|token)'
    confidence: HIGH
  - regex: '(password|passwd|pwd)\s*=\s*.*md5\s*\('
    confidence: HIGH
  - regex: '\bmd5\s*\('
    confidence: LOW   # Generic usage — review context

Now the rule distinguishes between MD5 in credential contexts (HIGH confidence, almost certainly a problem) and generic MD5 usage (LOW confidence, warrants a look but probably fine). The total finding count might be the same, but the actionable finding count — the ones that genuinely require a fix — goes up as a proportion of the total.

This is the most sustainable way to reduce false positives: better rules, not more suppressions.

The False Negative Trade-off

Every time you tune a rule to reduce false positives, you risk introducing false negatives — real vulnerabilities the scanner no longer catches.

This is the fundamental tension in SAST tool design. It has no clean resolution. It only has a deliberate choice.

If you tighten the MD5 rule to only flag credential contexts, you'll miss the case where a developer uses a custom variable name:

# Now invisible to the tightened rule
user_auth_hash = hashlib.md5(user_password).hexdigest()

The question is: which failure mode is more expensive for your specific context?

If your team is diligent about triage and the cost of a false negative (missed vulnerability) is high — financial services, healthcare, anything with regulatory consequences — keep rules broader and invest in the triage process.

If your team is drowning in noise and findings aren't getting triaged at all — the scanner has already effectively failed — tighten the rules to rebuild trust, accept the trade-off, and plan to layer in additional controls elsewhere.

There's no universally correct answer. There's only an honest assessment of your specific situation.

What a Healthy Suppression Profile Looks Like

After a few months of running the scanner with a consistent process, here's what healthy metrics look like:

Suppression rate below 20%. If more than 1 in 5 findings is being suppressed, your rules are too noisy for your codebase. Tune the rules rather than suppressing everything.

No suppressions without justification comments. Bare # sast-ignore annotations with no explanation are a red flag. Make justification comments a code review requirement.

Suppression inventory reviewed quarterly. Old suppressions that are no longer valid are silent technical debt. A quarterly review catches them.

False positive rate declining over time. As you tune rules based on real-world results, your false positive rate should go down. If it's stable or increasing, you're not learning from your suppression data.

New findings triaged within one sprint. If findings from a scan are still unreviewed after two weeks, your triage process isn't keeping up. Either reduce the finding volume (tune rules) or increase triage capacity.

The Bigger Point

False positive management is not a technical problem. It's a trust and process problem that has technical levers.

The suppression system in my scanner — inline annotations, justification comments, suppressed findings in the JSON output, confidence levels on patterns — these are all technical levers. But they only work in the context of a team that has agreed on how to use them.

The best SAST implementation I can imagine is one where:

Engineers trust the scanner because it has a low false positive rate
The scanner trusts engineers because suppressions are reviewed and justified
Security teams trust both because the suppression inventory is auditable and periodically reviewed That's not a configuration. That's a culture. The configuration just makes the culture possible.

Full source and suppression documentation at github.com/pgmpofu/sast-tool.

Next up — the final article in this series: what building all of this taught me about application security that 13 years of software engineering didn't.

DEV Community