DEV Community: Patience Mpofu

Modernising a 6-Year-Old Spring Boot Project Without Breaking Everything

Patience Mpofu — Mon, 18 May 2026 14:27:18 +0000

Before I could meaningfully remediate the 188 vulnerabilities Snyk found in MFlix, I had to confront something uncomfortable.

The project structure itself was the problem.

Not the code — the code was fine for what it was. But the way it was organised, configured, and built reflected 2018 Spring Boot conventions that created friction for every subsequent change. Trying to apply modern security fixes to an unrenovated codebase is like trying to rewire a house without updating the fuse box. You can do it, but every step is harder than it needs to be.

This article is about the modernisation work I did before touching a single CVE — what the 2019 structure looked like, what I changed, why I changed it, and what I deliberately kept.

What a 2019 Spring Boot Project Looks Like

When MFlix was built, Spring Boot 2.0.x was the current major version. Java 8 was the standard enterprise runtime. The project structure followed conventions of that era:

mflix/
├── pom.xml
├── src/
│   ├── main/
│   │   ├── java/
│   │   │   └── mflix/
│   │   │       ├── api/
│   │   │       │   ├── MoviesController.java
│   │   │       │   └── UsersController.java
│   │   │       ├── config/
│   │   │       │   └── MongoDBConfiguration.java
│   │   │       └── daos/
│   │   │           ├── MovieDao.java
│   │   │           └── UserDao.java
│   │   └── resources/
│   │       └── application.properties
│   └── test/
│       └── java/
│           └── mflix/

Functional. Reasonable for its time. But several things stood out immediately when I looked at it with fresh eyes in 2025:

Java 8 compiler target. The pom.xml declared <source>1.8</source> and <target>1.8</target>. Java 8 reached end-of-life for free Oracle support in January 2019 — the same month this project was likely being committed. Six years of security patches, language improvements, and performance gains left on the table.

Mixed Spring Boot versions. The pom.xml declared spring-boot-starter-web@2.0.3 and spring-boot-starter-security@2.0.4 separately, with explicit version pinning on individual Spring Framework components (spring-context, spring-core, spring-web all at 5.0.7). Modern Spring Boot projects use a parent BOM (Bill of Materials) that manages version alignment across the entire Spring ecosystem. Manually pinning individual Spring component versions is how you end up with the kind of version drift that generates 188 CVEs.

No dependency management section. Without a <dependencyManagement> block or a parent BOM, transitive dependency versions are determined entirely by whatever the top-level dependencies pull in — with no explicit control or visibility.

application.properties with a hardcoded MongoDB URI. The connection string for the MongoDB Atlas cluster was in the properties file rather than being externalised to environment variables. That's not a Snyk finding, but it's a security hygiene issue that should be addressed before anything else.

The Modernisation Goals

I set three goals before writing a line of changed code:

Goal 1: Move to a Spring Boot parent BOM. This single change would bring version alignment across the entire Spring ecosystem under centralised control. Every Spring component version becomes managed by the BOM rather than individually pinned.

Goal 2: Upgrade the Java target to 17. Java 17 is the current LTS release and the minimum target for Spring Boot 3.x. Moving from Java 8 to Java 17 closes nine years of language evolution and gives access to Spring Boot 3.x's security improvements.

Goal 3: Externalise secrets. The MongoDB connection URI, JWT signing key, and any other credentials needed to move out of application.properties and into environment variables before any other change.

Goal 3 was intentionally first. Before running any security tooling or making any dependency changes, the sensitive configuration needed to be out of the code.

Step 1: Externalising Secrets

The application.properties contained:

spring.data.mongodb.uri=mongodb+srv://admin:password@cluster.mongodb.net/mflix
jwt.secret=mflix-jwt-secret-key

Both values needed to go. The replacement:

# application.properties — safe to commit
spring.data.mongodb.uri=${MONGODB_URI}
jwt.secret=${JWT_SECRET}

# .env — never committed, in .gitignore
MONGODB_URI=mongodb+srv://admin:password@cluster.mongodb.net/mflix
JWT_SECRET=mflix-jwt-secret-key

And .gitignore updated:

.env
*.env
application-local.properties

Simple. Ten minutes. Should have been done in 2019. The secrets detector I wrote would have caught both of these had it been running as a pre-commit hook — which is a satisfying bit of cross-project validation.

Step 2: Introducing the Spring Boot Parent BOM

The single most impactful structural change in the modernisation was adding the Spring Boot parent BOM.

Before:

<project>
    <modelVersion>4.0.0</modelVersion>
    <groupId>mongodb.university</groupId>
    <artifactId>mflix</artifactId>
    <version>1.0-SNAPSHOT</version>

    <dependencies>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
            <version>2.0.3.RELEASE</version>
        </dependency>
        <dependency>
            <groupId>org.springframework</groupId>
            <artifactId>spring-context</artifactId>
            <version>5.0.7.RELEASE</version>
        </dependency>
        <!-- etc — every version pinned manually -->
    </dependencies>
</project>

After:

<project>
    <modelVersion>4.0.0</modelVersion>
    <groupId>mongodb.university</groupId>
    <artifactId>mflix</artifactId>
    <version>1.0-SNAPSHOT</version>

    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>3.2.5</version>
        <relativePath/>
    </parent>

    <properties>
        <java.version>17</java.version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
            <!-- No version — managed by parent BOM -->
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-security</artifactId>
            <!-- No version — managed by parent BOM -->
        </dependency>
        <!-- Individual spring-context, spring-core, spring-web removed -->
        <!-- BOM pulls in correct aligned versions automatically -->
    </dependencies>
</project>

What this change does:

The parent BOM declares tested, compatible versions for the entire Spring ecosystem
Individual Spring Framework components (spring-context, spring-core, spring-web) no longer need to be declared separately — they're pulled in as transitive dependencies of the starters at the correct version
java.version property drives compiler configuration through the parent's plugin management
All Spring component versions move in lockstep, eliminating the version drift that contributed to the CVE accumulation The version jump from 2.0.3 to 3.2.5 is a major version upgrade. Spring Boot 3.x dropped support for Java 8, requires Jakarta EE 10 namespace (jakarta.* instead of javax.*), and brought a range of breaking API changes. Those breaking changes are what make this step the most work-intensive part of the modernisation.

Step 3: The Jakarta Namespace Migration

Spring Boot 3.x moved from the javax.* namespace (Java EE) to jakarta.* (Jakarta EE). Every import in the codebase that referenced javax.servlet, javax.persistence, or similar needed updating.

In MFlix, the affected imports were primarily in the security configuration and controller layer:

Before:

import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import javax.validation.Valid;

After:

import jakarta.servlet.http.HttpServletRequest;
import jakarta.servlet.http.HttpServletResponse;
import jakarta.validation.Valid;

This is mechanical work rather than architectural work — find and replace across the codebase. Modern IDEs handle it automatically with a refactoring tool. The risk is missing an occurrence, which produces a compile error rather than a runtime bug, so it's catchable.

Step 4: Spring Security Configuration Modernisation

The biggest code change in the modernisation was the Spring Security configuration. Spring Boot 3.x deprecated and then removed the WebSecurityConfigurerAdapter pattern that was standard in Spring Boot 2.x.

The 2019 pattern (deprecated, removed in Spring Boot 3.x):

@Configuration
@EnableWebSecurity
public class SecurityConfig extends WebSecurityConfigurerAdapter {

    @Override
    protected void configure(HttpSecurity http) throws Exception {
        http
            .csrf().disable()
            .authorizeRequests()
                .antMatchers("/api/v1/movies/**").permitAll()
                .antMatchers("/api/v1/users/login").permitAll()
                .anyRequest().authenticated()
            .and()
            .sessionManagement()
                .sessionCreationPolicy(SessionCreationPolicy.STATELESS);
    }

    @Override
    protected void configure(AuthenticationManagerBuilder auth) throws Exception {
        auth.userDetailsService(userDetailsService)
            .passwordEncoder(passwordEncoder());
    }
}

The modern pattern (Spring Boot 3.x):

@Configuration
@EnableWebSecurity
public class SecurityConfig {

    @Bean
    public SecurityFilterChain securityFilterChain(HttpSecurity http) throws Exception {
        http
            .csrf(csrf -> csrf.disable())
            .authorizeHttpRequests(auth -> auth
                .requestMatchers("/api/v1/movies/**").permitAll()
                .requestMatchers("/api/v1/users/login").permitAll()
                .anyRequest().authenticated()
            )
            .sessionManagement(session -> session
                .sessionCreationPolicy(SessionCreationPolicy.STATELESS)
            );

        return http.build();
    }

    @Bean
    public AuthenticationManager authenticationManager(
            AuthenticationConfiguration config) throws Exception {
        return config.getAuthenticationManager();
    }
}

The functional change is minimal — the same security rules apply. The structural change is significant: instead of extending an abstract class and overriding methods, the configuration is composed through beans with a fluent lambda-based API. The new pattern is cleaner, more testable, and aligns with Spring's component model more naturally.

Two specific API changes worth noting:

antMatchers() → requestMatchers() — The method rename is straightforward but easy to miss because both compile without error in certain configurations; only the runtime behaviour differs.

authorizeRequests() → authorizeHttpRequests() — This change has security implications beyond naming. authorizeHttpRequests() uses the newer AuthorizationManager API which short-circuits earlier in the request processing chain and is more consistent in its behaviour across different dispatcher types.

Step 5: JWT Library Migration

The jjwt library had its own breaking change to address. io.jsonwebtoken:jjwt@0.9.1 — which Snyk gave a priority score of 889 and found 58 fixable issues in — underwent a major API restructuring between 0.9.x and 0.12.x.

Before (jjwt 0.9.1 API):

String token = Jwts.builder()
    .setSubject(userId)
    .setIssuedAt(new Date())
    .setExpiration(expiration)
    .signWith(SignatureAlgorithm.HS256, secret)
    .compact();

Claims claims = Jwts.parser()
    .setSigningKey(secret)
    .parseClaimsJws(token)
    .getBody();

After (jjwt 0.12.0 API):

String token = Jwts.builder()
    .subject(userId)
    .issuedAt(new Date())
    .expiration(expiration)
    .signWith(secretKey, Jwts.SIG.HS256)
    .compact();

Claims claims = Jwts.parser()
    .verifyWith(secretKey)
    .build()
    .parseSignedClaims(token)
    .getPayload();

The key change — beyond the fluent API restructuring — is how the signing key is handled. jjwt 0.9.1 accepted a raw String as a signing key, which is a known security weakness. A short or low-entropy string could be brute-forced if an attacker obtained a token. jjwt 0.12.0 requires a proper SecretKey object, which enforces minimum key length requirements at the API level.

// 0.9.1 — accepts any string, no validation
.signWith(SignatureAlgorithm.HS256, "weak")

// 0.12.0 — requires proper SecretKey, enforces minimum length
SecretKey key = Keys.hmacShaKeyFor(
    Decoders.BASE64.decode(base64EncodedSecret)
);
.signWith(key, Jwts.SIG.HS256)

This is an example of a library upgrade that isn't just a security patch — it's a security design improvement. The new API makes it harder to write insecure code, not just patching a specific vulnerability.

What I Deliberately Kept

Not everything needed to change.

The DAO layer. The MongoDB operations in MovieDao and UserDao use the Java driver directly with proper parameterised queries. The code is correct, readable, and doesn't need to be rewritten just because the framework version changed. Unnecessary refactoring introduces risk without benefit.

The API structure. The REST endpoint design in MoviesController and UsersController is sound. RESTful conventions, appropriate HTTP status codes, clear URL structure. These don't need to change to fix security issues.

The test suite. The JUnit 5 tests — updated to junit-jupiter-api@5.10.x from 5.1.0 — largely passed after the namespace migration. Keeping the tests as close to their original form as possible gave me confidence that the modernisation hadn't changed the application's behaviour.

The guiding principle: change what needs to change for security and maintainability. Don't rewrite what doesn't need to be rewritten.

The Modernisation Diff — What Actually Changed

Summarising the changes as a before/after:

Area	Before	After
Java version	1.8 (Java 8)	17
Spring Boot	2.0.3–2.0.4 (manually pinned)	3.2.5 (BOM managed)
Spring Framework	5.0.7 (manually pinned)	6.1.x (BOM managed)
jjwt	0.9.1	0.12.0
Security config	`WebSecurityConfigurerAdapter`	`SecurityFilterChain` bean
Namespace	`javax.*`	`jakarta.*`
Secrets	Hardcoded in properties file	Environment variables
Dependency versions	8 manually pinned	BOM managed

What the Modernisation Did to the Snyk Results

Running Snyk after the modernisation — before doing any targeted vulnerability remediation — already moved the numbers significantly.

The BOM upgrade to Spring Boot 3.2.5 automatically resolved a large portion of the Spring-related CVEs because the BOM pulled in patched versions of Spring Framework, Tomcat, and Jackson. The jjwt upgrade to 0.12.0 cleared all 58 of its fixable issues in a single version change.

I'll show the full before/after numbers in article 6. For now, the preview: the modernisation alone — without any targeted CVE remediation — reduced the total finding count substantially. This illustrates an important point about legacy Java dependency management: often the most effective security intervention isn't patching individual CVEs, it's getting the project onto a supported version of its primary framework and letting the framework's dependency management do the heavy lifting.

The modernised repository is at github.com/pgmpofu/mflix on the modernised branch.

Next up: the full unfiltered Snyk results — a deeper look at the most significant findings, what each vulnerability actually enables an attacker to do, and why the RCE in spring-beans deserved the attention it got.

I Dusted Off a 6-Year-Old Java Project and Ran Snyk Against It — Here's What I Found

Patience Mpofu — Mon, 18 May 2026 14:24:28 +0000

The README said "implementing security best practices."

That line has been sitting in the pgmpofu/mflix repository since 2019. A MongoDB-backed movie browsing application with user registration, authentication, JWT-based sessions, and full CRUD operations. Built as part of a MongoDB University course. Described, in my own words, as implementing security best practices.

I ran Snyk against it last week.

188 vulnerabilities. 10 Critical. 99 High. 59 Medium. 20 Low.

Every single one fixable. None of them there when I wrote that README.

This is the first article in a series about what happens when you apply modern software composition analysis to a Java project that hasn't been touched in six years — what Snyk found, what I fixed, what I chose not to fix, and what the before-and-after security posture actually looks like in measurable terms.

What the Project Is

MFlix is a Spring Boot Java application backed by MongoDB. The core functionality:

Movie search — basic and complex queries against a MongoDB collection
User registration and authentication
JWT-based session management using io.jsonwebtoken:jjwt
Comment posting on movie entries
Analytical reporting against the movie dataset
Spring Security for access control It's not a toy. It has real authentication flows, real database operations, and real dependency complexity. The pom.xml has eight direct dependencies spanning Spring Boot, Spring Security, the MongoDB Java driver, and the JWT library.

That dependency tree is where the story starts.

The Dependency Snapshot — What Was In the pom.xml

Before running anything, here's exactly what the project declared as of the last commit in 2019:

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-web</artifactId>
    <version>2.0.3.RELEASE</version>
</dependency>

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-security</artifactId>
    <version>2.0.4.RELEASE</version>
</dependency>

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot</artifactId>
    <version>2.0.4.RELEASE</version>
</dependency>

<dependency>
    <groupId>org.springframework</groupId>
    <artifactId>spring-context</artifactId>
    <version>5.0.7.RELEASE</version>
</dependency>

<dependency>
    <groupId>org.springframework</groupId>
    <artifactId>spring-core</artifactId>
    <version>5.0.7.RELEASE</version>
</dependency>

<dependency>
    <groupId>org.springframework</groupId>
    <artifactId>spring-web</artifactId>
    <version>5.0.7.RELEASE</version>
</dependency>

<dependency>
    <groupId>io.jsonwebtoken</groupId>
    <artifactId>jjwt</artifactId>
    <version>0.9.1</version>
</dependency>

<dependency>
    <groupId>org.mongodb</groupId>
    <artifactId>mongodb-driver-sync</artifactId>
    <version>3.9.1</version>
</dependency>

Eight direct dependencies. All declared at versions that were current in mid-2018 to mid-2019. All untouched since.

The Java compiler target is 1.8 — Java 8, which reached end of life for free Oracle support in January 2019. The project has been running on a deprecated runtime configuration since approximately the month it was committed.

Running the Scan

Setup is straightforward. With Snyk installed and authenticated:

cd mflix
snyk test --all-projects

Snyk reads the pom.xml, resolves the full dependency tree including transitive dependencies, and cross-references every package version against its vulnerability database.

The result appeared in seconds. The dashboard view told the story immediately:

10 Critical · 99 High · 59 Medium · 20 Low

Total: 188 fixable vulnerabilities. Zero with no supported fix.

That last number — zero unfixable — is actually significant. Every single vulnerability Snyk found has a known fix available. I'll come back to what that means for the remediation strategy.

The Finding That Stopped Me Cold

Before getting into the full breakdown, one finding deserves immediate attention.

org.springframework:spring-context@5.0.7.RELEASE — Remote Code Execution. CVSS 9.8. Priority Score 919.

The vulnerability is in spring-beans@5.0.7.RELEASE via CWE-94 — improper control of code generation. An attacker who can reach the application can, under certain conditions, execute arbitrary code on the server.

CVSS 9.8. That's not a theoretical risk category. That's one point below the maximum possible score. That's the kind of finding that, in a production system, triggers an emergency change control process at 11pm on a Friday.

MFlix was never deployed to production. The attack surface was always zero. But the finding is real — the vulnerability exists in the version of spring-beans that ships with spring-context@5.0.7, and it would be exploitable if the application were running and accessible.

This is the finding that "security best practices" in the README was supposed to prevent. It didn't — not because of negligence, but because security best practices in 2019 didn't include the CVE that was disclosed after the code was written.

The Full Breakdown by Dependency

Here's what Snyk found grouped by the dependency it originated from, with priority scores and headline vulnerability types:

Critical Severity (Priority Score 919)

org.springframework:spring-web@5.0.7.RELEASE
11 direct issues, 8 transitive issues. The highest-priority dependency in the scan.

Remote Code Execution — CWE-94, CVSS 9.8 (via spring-beans)
Privilege Escalation — CWE-264, CVSS 4.4
Improper Input Validation — CWE-20, CVSS 8.6
Reflected File Download — CWE-494, CVSS 8.0
Denial of Service — CWE-400, CVSS 3.7 org.springframework:spring-context@5.0.7.RELEASE 2 direct issues, 11 transitive issues.
Remote Code Execution — CWE-94, CVSS 9.8 (via spring-beans)
Relative Path Traversal — CWE-23, CVSS 8.2
Incorrect Authorization — CWE-863, CVSS 8.7 org.springframework.boot:spring-boot-starter-security@2.0.4.RELEASE 37 transitive issues.
Authorization Bypass — CWE-285, CVSS 8.2
Reflected File Download — CWE-494, CVSS 8.0
Improper Input Validation — CWE-20, CVSS 8.6 org.springframework.boot:spring-boot-starter-web@2.0.3.RELEASE 206 transitive issues — the single dependency pulling in the most downstream vulnerabilities.
Insecure Defaults via tomcat-embed-core@8.5.31 — CWE-453, CVSS 9.8
Deserialization of Untrusted Data via jackson-databind@2.9.6 — CWE-502, CVSS 9.2
Session Fixation via tomcat-embed-core — CWE-384, CVSS 3.1
Cross-Site Scripting via tomcat-embed-core — CWE-79, CVSS 3.5 io.jsonwebtoken:jjwt@0.9.1 — Priority Score 889 63 transitive issues, 58 fixable. All fixed in a single upgrade to version 0.12.0.
Deserialization of Untrusted Data via jackson-databind@2.9.6 — CWE-502, CVSS 9.2
Allocation of Resources Without Limits via jackson-core — CWE-770, CVSS 8.x ### Critical — Certificate Validation

org.springframework.boot:spring-boot-autoconfigure@2.0.3.RELEASE

Improper Validation of Certificate with Host Mismatch — CWE-297, CVSS 9.3 This one matters specifically because MFlix handles user authentication. A TLS certificate validation bypass in the autoconfigure layer means a connection claiming to be a trusted service could potentially be impersonated without detection.

High Severity (Priority Score 649)

org.springframework.boot:spring-boot@2.0.4.RELEASE
4 direct issues, 7 transitive.

Insecure Temporary File — CWE-377, CVSS 7.8
Incorrect Authorization — CWE-863, CVSS 8.7 org.springframework:spring-core@5.0.7.RELEASE 5 direct issues.
Incorrect Authorization — CWE-863, CVSS 8.7
Improper Case Sensitivity Handling — CWE-178, CVSS 2.3 ### Medium Severity

org.mongodb:mongodb-driver-sync@3.9.1
1 direct issue.

Man-in-the-Middle — CWE-300, CVSS 6.4 The MongoDB driver finding is particularly relevant for this application. MFlix connects to a MongoDB Atlas cluster. A MitM vulnerability in the driver layer means the connection between the application and the database could potentially be intercepted.

The Transitive Dependency Problem

The number that tells the real story of legacy Java dependency management is this one:

spring-boot-starter-web@2.0.3 — 206 transitive issues.

One line in the pom.xml. One version declaration. 206 downstream vulnerabilities pulled in through the dependency chain, in packages the application never explicitly imported and whose version numbers never appeared in the build file.

This is the fundamental challenge of Software Composition Analysis in Java projects. The pom.xml has eight dependencies. The actual dependency graph has dozens of packages. Each of those packages has its own version, its own CVE history, its own patch cadence.

A developer in 2018 who wrote spring-boot-starter-web@2.0.3 made a single decision. That decision pulled in a specific version of Tomcat, a specific version of Jackson, a specific version of Spring's core libraries — all of which have accumulated vulnerabilities over six years. None of those vulnerabilities were visible at the point of the original decision.

This is why dependency scanning exists. The alternative — manually tracking CVE disclosures across every transitive dependency in your build graph — is not a realistic approach for any team at any scale.

The Exploit Maturity Breakdown

Not all 188 findings carry the same actual risk. Snyk's exploit maturity classification tells a more nuanced story:

Maturity Level	Count
Mature exploits (working exploit code exists)	8
Proof-of-concept exploits	43
No known exploit	137

Eight findings have working, publicly available exploit code. These are the ones that require the most urgent remediation — not because the vulnerability is necessarily more severe than others, but because the barrier to exploitation is effectively zero. Anyone with the exploit code and network access to the application could use it.

The 137 findings with no known exploit are real vulnerabilities — they're in the CVE database, they have CVSS scores, Snyk flags them correctly — but the practical attack risk is lower because exploitation requires custom effort rather than running a public tool.

This breakdown becomes critical for article 4, where I'll explain which findings I remediated, which I suppressed, and why the exploit maturity classification was the primary factor in that decision.

What "100% Fixable" Actually Means

One number that surprised me when I first saw the Snyk output: 188 fixable, 0 with no supported fix.

That's unusually clean. In my experience scanning production Java codebases at work, there are almost always some findings in the "no supported fix" category — typically because a vulnerable package has no patched version available, or because the fix requires changes that would break the API the application depends on.

MFlix having 100% fixable findings is partly a function of how old the dependencies are. Six years is a long time. Every major vulnerability that exists in Spring Boot 2.0.x, Spring 5.0.x, and jjwt 0.9.1 has had years to be patched in subsequent versions. The fix exists — I just have to apply it.

The question is how straightforward "applying the fix" actually is. Some of these are minor version bumps. Others are major version upgrades — Snyk recommends going from spring-boot-starter-web@2.0.3 to 2.0.6 for some fixes and all the way to 3.x for others. Major version upgrades in Spring Boot involve breaking API changes that require code modifications, not just version bumps.

That complexity is what articles 4 and 5 are about. The 100% fixability rate is the theoretical ceiling. The practical remediation story is considerably more interesting.

The Irony of the README

I want to return to that line. "Implementing security best practices."

It was accurate in 2019. I used parameterised queries for MongoDB operations. I implemented JWT authentication correctly for the era. I used Spring Security for access control. I followed the MongoDB University course's security guidance.

None of that is what Snyk found. What Snyk found is a different category of security problem entirely — not vulnerabilities in the code I wrote, but vulnerabilities in the dependencies I imported. The distinction matters because it reveals a gap in how most developers think about application security.

When developers think about writing secure code, they think about SQL injection, authentication flows, authorisation checks, input validation. These are code-level concerns. A skilled developer can learn them, apply them consistently, and write code that is largely free of them.

Software composition vulnerabilities are different. They accumulate silently. They appear in packages you didn't write and may never read. They arrive via CVE disclosures months or years after you made your dependency choices. They require an ongoing process — not a one-time skill — to manage.

The "security best practices" that prevent code-level vulnerabilities are largely different from the "security best practices" that prevent composition vulnerabilities. Both matter. Until I ran this scan, I was only thinking about one of them.

What Comes Next

Over the next six articles in this series, I'll document:

Article 2 — Modernising the project structure: wrapping the legacy code in a current Spring Boot shell, what changed, and what had to change before Snyk could even give me a useful remediation path.

Article 3 — The full unfiltered Snyk results: a deeper dive into the most significant findings with full CVE context, what each vulnerability actually enables an attacker to do, and which ones matter most for an application with user authentication and database access.

Article 4 — Why I suppressed some findings and fixed others: the risk assessment framework, the role of exploit maturity in prioritisation, and the findings I made a documented decision to accept.

Article 5 — The remediation work itself: the easy version bumps, the breaking changes that required code modification, and the one dependency upgrade that took four attempts to get right.

Article 6 — Before and after metrics: what changed, how to measure security posture improvement, and what the numbers look like when you present them to an engineering team.

Article 7 — Using AI to assist with remediation: what worked, what didn't, and the difference between AI suggestions and Snyk recommendations on the same vulnerabilities.

The repository is at github.com/pgmpofu/mflix. The pom.xml in the current state is exactly as it was in 2019. The Snyk findings are real. The remediation work is ongoing.

Next article: modernising the project — what a 2019 Spring Boot structure looks like compared to what a current one should look like, and what had to change before the security work could begin in earnest.

I Ran My ML Secrets Detector Against My Own Repositories — Here's What It Found

Patience Mpofu — Sat, 16 May 2026 03:00:54 +0000

here's a moment every security tool builder eventually faces.

You've built the scanner. You've written the rules. You've validated it against synthetic test cases and contrived examples. And then you point it at your own code — the repositories you've actually written, committed, and pushed over years of real development work.

That moment is humbling.

I ran my ML secrets detector against every personal repository I own — 11 repositories across Python, Java, Node.js, and Kotlin projects accumulated over several years of portfolio building and side projects. I'm documenting the results honestly: what it found, what was real, what was a false positive, and what the numbers actually looked like.

The Setup

Before running, I configured the scan for comprehensive coverage:

# Full repository scan including git history
python main.py scan ./repos/ \
  --include-history \
  --threshold 0.65 \
  --format all \
  --output ./scan-results/

A threshold of 0.65 rather than the default 0.70 — I wanted to see more findings, including ones that would normally sit just below the reporting threshold. For an audit of your own code, more signal is better than less.

The --include-history flag scans not just the current working tree but every commit in git history. This is the mode that makes people nervous. Whatever got committed and "fixed" later is still in the history. It's still accessible. It still needs to be addressed.

Repositories scanned: 11

Total commits scanned: 847

Total files scanned: 2,341

Scan duration: 4 minutes 23 seconds

The Raw Numbers

Severity	Findings	Confirmed Real	False Positives	False Positive Rate
CRITICAL	7	6	1	14%
HIGH	19	11	8	42%
MEDIUM	31	9	22	71%
Total	57	26	31	54%

A few things to unpack here.

The CRITICAL findings had a 14% false positive rate — one in seven was benign. That's roughly what I expected based on the test set results. The one false positive was a 32-character hex string in a variable named encryption_mode — the word "encryption" pushed the key name score high, but the value was actually a configuration mode identifier, not a key.

The HIGH findings had a 42% false positive rate. Higher than I'd like, but consistent with the nature of HIGH confidence findings — they're cases where the evidence is strong but not overwhelming. Most of the false positives in this tier were package integrity hashes in older package-lock.json files that hadn't been added to the skip list yet.

The MEDIUM findings had a 71% false positive rate. This is expected and by design. MEDIUM findings are prompts for human review, not automatic defects. Most were generic high-entropy strings in configuration files where the variable names were moderately suspicious but the values were benign.

The overall 54% false positive rate sounds alarming until you account for the lower threshold (0.65 vs. default 0.70) and the MEDIUM tier. At the default threshold, the false positive rate drops to approximately 28% — closer to the test set results.

The Real Findings: What Was Actually There

Of the 26 confirmed real findings, here's what they were. I've anonymised the specific values but documented the pattern honestly.

Finding 1–3: Test Credentials That Never Left Test Files (But Were Still Committed)

Three findings were test database credentials in integration test configuration files:

# tests/integration/test_database.py (2021 commit)
TEST_DB_PASSWORD = "integration_test_password_2021"
TEST_DB_URL = "postgresql://testuser:local_test_pass@localhost/testdb"

These were intentionally "fake" credentials — values I created specifically for local testing. But they were committed to a public repository. The classifier flagged them at 87% and 91% confidence respectively.

Are these real vulnerabilities? Technically no — a local test database password with no external access isn't a secret in the traditional sense. But they taught me something: even intentional test credentials get flagged, which means either the suppression annotation should have been there from the start, or the test configuration should have used environment variables even for local test values.

The lesson isn't that the scanner was wrong. It's that "this is only for testing" is not a reason to skip secure credential handling.

Finding 4: An Actual JWT Secret (History)

This one made my stomach drop.

CRITICAL (97%) · src/auth/config.py:23
jwt_secret = "my-jwt-signing-secret-change-this"
↳ History: commit a3f8b2c · 2020-03-14

I found a hardcoded JWT signing secret in a 2020 commit to a project that I had since "fixed" by moving to environment variables. The fix was in the current code. The secret was still in git history.

The value itself — "my-jwt-signing-secret-change-this" — is one of those values that developers write with the intention of replacing it before going anywhere near production. The comment is literally in the name. But it got committed, and committed things live in git history forever unless you rewrite it.

The project was never deployed to production with this value. But it was a public repository. Anyone who cloned it at any point in 2020 has this value. The theoretical attack surface was real even if the practical exploitation probability was low.

What I did: Rewrote the commit history using git filter-branch to remove the file containing the secret, then force-pushed. I also added a .gitignore entry for config.py files and a pre-commit hook (obviously) to catch this pattern in future.

Finding 5–8: API Keys in Old Test Scripts

Four findings were API keys in utility scripts I'd written to test integrations:

# scripts/test_sendgrid.py (2019 commit)
SENDGRID_API_KEY = "SG.abc123...xyz789"  # key has been rotated

These were real API keys at the time of commit. I confirmed with the respective providers that all four had been rotated or the accounts had been closed — so the operational risk was zero. But they were real keys that were real secrets when committed.

This is the most common pattern in real credential exposure incidents: keys that were live at the time of commit, rotated after discovery, but remain in history as evidence of the exposure. The key rotation closes the operational risk but doesn't erase the fact of the exposure.

What I did: Rotated anything still active (none were), documented the historical exposure, and rewrote history for the two repositories where the keys were in active-looking scripts. For older repositories where the scripts were clearly abandoned, I left the history intact and noted the exposure in the repository README.

Finding 9–11: Internal Service URLs With Embedded Credentials

Three findings were database and service connection strings:

# config/database.py (2022 commit)
DATABASE_URL = "postgresql://admin:password123@internal-host:5432/appdb"

None of these were production credentials — they were development environment connection strings pointing to local or development hosts. But the pattern is exactly what you see in production credential exposures, and the scanner correctly identified them as high confidence.

Two were for hosts that no longer exist. One was for a development Postgres instance that still exists but has no external network access. The operational risk was low; the pattern risk was real.

Finding 12: A Private Key Fragment in a README

The most surprising finding:

CRITICAL (99%) · README.md:47
-----BEGIN RSA PRIVATE KEY-----
MIIEowIBAAKCAQEA...

A README containing an example private key that I'd generated specifically to demonstrate what a private key looks like in documentation. It was a real RSA private key — not a truncated fake — but generated purely for documentation purposes and never associated with any system.

The scanner correctly flagged it. The private key has never been used for anything. But it's a valid RSA private key that anyone could theoretically use to claim they found something in my repository.

What I did: Replaced the real private key in the README with a clearly truncated fake:

-----BEGIN RSA PRIVATE KEY-----
MIIEowIBAAKCAQEA[EXAMPLE - NOT A REAL KEY]...
-----END RSA PRIVATE KEY-----

If you're writing documentation that shows what a private key looks like, never use a real generated key. Generate a fake-looking placeholder instead.

Findings 13–26: Various Confirmed Vulnerabilities

The remaining 14 confirmed findings were a mix of:

Hardcoded passwords in older Java projects using Spring with properties files committed directly
OAuth client secrets in mobile app prototype code from 2018–2019
Slack webhook URLs (which are effectively secrets — anyone with the URL can post to your channel)
Internal service tokens from a project that has since been decommissioned All were historical, all have been rotated or decommissioned. All are now either suppressed with justification or removed from history.

The False Positives: What Triggered Them

The 31 false positives clustered into four categories:

Category 1: Package Lock File Hashes (12 findings)

The most numerous false positive source. package-lock.json files contain SHA-512 integrity hashes for every dependency:

"integrity": "sha512-abc123def456..."

These are high-entropy strings in a file that often has keys named integrity. The key name risk for "integrity" is 0.0 in my vocabulary, which should push these below threshold — and at the default 0.70 threshold, most don't appear. At 0.65, several edge cases squeaked through.

Fix: Added package-lock.json, yarn.lock, and *.lock to the global skip list.

Category 2: UUID Values With Moderately Sensitive Variable Names (8 findings)

session_token = "550e8400-e29b-41d4-a716-446655440000"
auth_correlation_id = "7c9b2de1-3f4a-8b5c-2d1e-9f8a7b6c5d4e"

"Session token" and "auth correlation ID" score moderately high on key name risk. UUIDs have moderate entropy. The combination pushed these above 0.65.

Fix: Added correlation_id, session_id, request_id, and similar terms to the explicitly benign vocabulary with a score of 0.0.

Category 3: Example Values in Documentation (7 findings)

Markdown files and READMEs containing example code snippets:

Set your API key:

python
API_KEY = "your-api-key-here"

python

"your-api-key-here" is low entropy and obviously a placeholder. The scanner correctly passes it. But other examples used more realistic-looking values:

API_KEY = "aK9mP2xL8vR3qT7nY5wZ1bJ4cH6dF0eI"

The variable name is high risk, the entropy is high, and no pattern matches — 78% confidence. False positive, but an understandable one.

Fix: Added .md and .rst files to a lower-confidence mode (threshold raised to 0.90 for documentation files) rather than skipping them entirely — real secrets do appear in committed documentation.

Category 4: High-Entropy Configuration Values (4 findings)

Configuration values that are long and random-looking but aren't secrets:

CACHE_KEY_PREFIX = "app_v2_prod_cache_2024_r3f8b2"
CORRELATION_HEADER = "X-Request-ID-v2-production-shard-3"

These are deterministic, human-readable configuration values that happen to be long and contain alphanumeric characters. Low false positive risk in most codebases but they appeared in mine.

Fix: These are the hardest category to address systematically. The suppression annotation is the right tool — add # secrets-ignore with a note that the value is a configuration constant.

What the History Scan Revealed That the Current Scan Didn't

Scanning history found 9 findings that don't appear in the current codebase — secrets that have been "fixed" but remain in git history. This is the most important capability of the history scanner and the most overlooked.

Findings in current code: 17
Findings only in history: 9
Total unique findings: 26

The 9 historical-only findings represent credentials that a developer committed, noticed (or was told about), and removed from the current code — but never removed from history. From a security perspective, these are live exposures. The credential exists in a public repository's history. Anyone who cloned the repository at any point has it.

The remediation for historical findings is harder than current findings:

Option 1: Rotate the credential. If the credential is still active, rotate it immediately. The historical exposure is already done — rotation closes the operational risk.

Option 2: Rewrite git history. Using git filter-branch or the newer git filter-repo, you can rewrite history to remove the file or commit containing the secret. This requires force-pushing, which is disruptive if other people have cloned the repository.

Option 3: Make the repository private. If the repository is public and the historical exposure is significant, making it private while history is cleaned up is a reasonable interim step.

Option 4: Document and accept. For decommissioned systems and rotated credentials with no active risk, documenting the historical exposure in the repository README and marking the findings as suppressed is acceptable. Not ideal, but pragmatic for old secrets with no active attack surface.

The Honest Assessment

Running the scanner against my own repositories was a genuinely useful exercise that I'd recommend to anyone building security tooling.

What worked well:

CRITICAL findings were high precision — 6 out of 7 were real
The history scanner found things I'd genuinely forgotten about
The scan was fast enough that 11 repositories in 4 minutes felt reasonable
The output was actionable — I knew exactly what to fix and where What needs improvement:
The HIGH finding false positive rate of 42% is too high for a production tool targeting real organisations. It would erode trust in a team context
The package-lock.json skip list should have been in place from the start — that's a known false positive source that I didn't anticipate fully
The threshold calibration needs work — 0.70 feels too conservative for CRITICAL findings and not conservative enough for HIGH findings The finding that most surprised me: The JWT secret in history. Not because finding it surprised me — that's exactly what the history scanner is for. Because I had genuinely forgotten it was there. I "fixed" the issue in 2020 by moving to environment variables and closed the mental file. The history scanner reopened it.

That's the value proposition of history scanning in one sentence: it finds the things you fixed but didn't actually fix.

What to Do If You Want to Run This Against Your Own Repos

Start with current code only, at the default threshold:

python main.py scan ./your-repo --threshold 0.70 --format terminal

Triage every CRITICAL finding before looking at anything else. Then work through HIGH. Treat MEDIUM as informational unless something catches your eye.

Once you've cleaned up the current state, run the history scan:

python main.py scan ./your-repo --include-history --threshold 0.70

Be prepared for findings you've forgotten about. Have a decision framework ready for each one: rotate, rewrite history, or document and accept.

The scan itself is the easy part. The remediation decisions are where the real work is.

The full tool, including the history scanner and all configuration options, is at github.com/pgmpofu/secrets-detector.

If you run it against your own repositories and find something interesting — or find a false positive pattern I haven't handled — open an issue. The tool gets better from real-world feedback, and real-world feedback only comes from people running it on real code.

Blocking Secrets Before They Hit the Repository: Building a Pre-Commit Hook With ML

Patience Mpofu — Sat, 16 May 2026 02:54:18 +0000

here are two places you can catch an exposed secret.

After it's in the repository — in a CI/CD pipeline scan, a periodic audit, or a breach notification from a security researcher who found it in your public history. Or before it ever gets there — at the moment of git commit, when the developer is still at their keyboard and the fix takes thirty seconds.

The second option is better in every dimension. Earlier detection means lower remediation cost. A blocked commit means no credential rotation required, no incident response, no git history rewriting. The developer who gets stopped at commit understands immediately what they did and why — the context is fresh, the fix is obvious.

The challenge is UX.

A pre-commit hook that's too slow gets disabled. A hook that generates too many false positives gets disabled. A hook that doesn't explain itself gets disabled and complained about on Slack. A hook that developers trust — that's fast, precise, and tells them exactly what it found and why — stays enabled and actually prevents exposures.

This article is about building a pre-commit hook that developers will actually leave on.

What the Hook Needs to Do

Before writing a line of code, I defined what a good pre-commit secrets hook looks like from the developer's perspective.

Speed. The hook runs on every commit. If it adds more than two or three seconds, developers will notice and resent it. On a typical feature branch with a handful of changed files, the scan needs to complete in under two seconds.

Scope. The hook should scan staged content — only the files about to be committed — not the entire repository. Scanning everything on every commit is unnecessary and slow.

Signal clarity. When the hook blocks a commit, the developer needs to know immediately: which file, which line, what variable, why it was flagged. "Secret detected" with no context is useless. "HIGH confidence (94%): api_key = "sk-proj-abc123..." in config/settings.py line 47 — matches OpenAI key format" is actionable.

Suppression path. Developers need a documented, low-friction way to handle false positives. The hook can't be a hard wall with no escape — that's how hooks get disabled entirely.

Non-destructive. The hook never modifies files. It either passes silently or blocks and explains. That's it.

Architecture: Scanning Staged Content

The first architectural decision is what to scan. There are two options:

Option A: Scan the working tree — the files as they currently exist on disk, including unstaged changes.

Option B: Scan the staged content — exactly what git diff --cached shows, which is what will actually be committed.

Option B is correct. Scanning the working tree means flagging things the developer hasn't committed and may never intend to commit. That's noise. Scanning staged content means flagging exactly what's about to enter the repository — which is the precise intervention point.

def get_staged_content() -> dict[str, str]:
    """Get the staged content for all modified/added files."""
    staged_files = {}

    # Get list of staged files
    result = subprocess.run(
        ["git", "diff", "--cached", "--name-only", "--diff-filter=ACM"],
        capture_output=True, text=True
    )

    filenames = result.stdout.strip().split('\n')

    for filename in filenames:
        if not filename:
            continue

        # Get staged content (not working tree content)
        content_result = subprocess.run(
            ["git", "show", f":{filename}"],
            capture_output=True, text=True
        )

        if content_result.returncode == 0:
            staged_files[filename] = content_result.content_result.stdout

    return staged_files

The --diff-filter=ACM flag limits to Added, Copied, and Modified files — not deletions. Scanning deleted file content would generate findings for secrets that are being removed, which is the wrong direction.

The Scan Loop: From Staged Content to Findings

The hook extracts string literal assignments from each staged file and passes them through the ML classifier:

def scan_staged_files(staged_content: dict[str, str], threshold: float = 0.7):
    findings = []

    for filepath, content in staged_content.items():
        # Skip binary files, lock files, and known safe extensions
        if should_skip_file(filepath):
            continue

        lines = content.split('\n')

        for line_num, line in enumerate(lines, 1):
            # Skip lines with suppression annotation
            if '# secrets-ignore' in line or '# nosec' in line:
                continue

            # Extract (key_name, value) pairs from string assignments
            assignments = extract_string_assignments(line)

            for key_name, value in assignments:
                if len(value) < 8:  # Skip very short strings
                    continue

                features = extract_features(value, key_name)
                confidence = model.predict_proba([features])[0][1]

                if confidence >= threshold:
                    findings.append({
                        "file": filepath,
                        "line": line_num,
                        "key_name": key_name,
                        "value_preview": value[:20] + "..." if len(value) > 20 else value,
                        "confidence": confidence,
                        "severity": confidence_to_severity(confidence)
                    })

    return findings

A few implementation details worth highlighting:

should_skip_file() excludes file types that generate systematic false positives: package-lock.json, yarn.lock, *.sum (Go module checksums), *.min.js (minified JavaScript), binary file extensions, and image files. These are maintained in a skip list rather than being hardcoded into the scan logic, so teams can extend it for their specific false positive patterns.

Value preview truncation. The finding reports only the first 20 characters of the flagged value, with ... truncation. Showing the full value in terminal output creates a secondary exposure — if someone is screen sharing when the hook fires, the secret shouldn't appear in full in the terminal.

Minimum length of 8. Strings shorter than 8 characters are almost never secrets. This eliminates a class of false positives from short configuration values and reduces scan time on files with many string literals.

The Output: Making Findings Actionable

The most important UX decision in the hook is what to show when a finding is blocked. I went through four iterations of the output format before settling on one that developers responded well to.

Iteration 1 (too terse):

BLOCKED: Secret detected in config/settings.py

Developers immediately asked: "What secret? Where exactly? What should I do?"

Iteration 2 (better but still vague):

BLOCKED: Possible secret at config/settings.py:47

Still not enough context. Developers had to open the file and count to line 47 to understand what was flagged.

Iteration 3 (too verbose):

[SECRETS DETECTOR] 
==========================================
COMMIT BLOCKED — POTENTIAL SECRET DETECTED
==========================================
File: config/settings.py
Line: 47
Variable: api_key
Value (truncated): sk-proj-abc123...
Confidence: 94%
Severity: CRITICAL
Matched Pattern: OpenAI API key format (sk-proj-*)
Feature contributions:
  - key_name_risk: 0.90 (HIGH)
  - shannon_entropy: 5.82 (HIGH)
  - pattern_openai_key: 1.00 (MATCH)
  - repetition_ratio: 0.94 (HIGH)

To suppress this finding, add '# secrets-ignore' to line 47
To bypass this check entirely (NOT RECOMMENDED): git commit --no-verify
==========================================

This is technically complete but overwhelming. Developers in flow state don't want to read a report. They want to know: what, where, what to do.

Final version (what shipped):

🔴 Secrets Detector — Commit Blocked

  CRITICAL (94%) · config/settings.py:47
  api_key = "sk-proj-abc123..."
  ↳ Matches OpenAI key format · High entropy · Sensitive variable name

  To suppress false positive: add  # secrets-ignore  to line 47
  To use env vars instead:    export API_KEY="your-key"
                              then  api_key = os.environ["API_KEY"]

1 finding blocked this commit. Fix the issue or suppress with justification.

The final format answers the three questions developers actually have in two seconds of reading: what is it (OpenAI key), where is it (file and line), what do I do (env var example or suppression). The feature contributions are available in verbose mode (--verbose) but don't appear by default.

The emoji is intentional. 🔴 provides an immediate visual signal in terminals that support it, and degrades gracefully to plain text in terminals that don't.

Handling Multiple Findings

When multiple findings exist, the output stacks them:

🔴 Secrets Detector — Commit Blocked

  CRITICAL (96%) · src/database.py:12
  DB_PASSWORD = "Tr0ub4dor&3"
  ↳ High-risk variable name · Matches human-chosen password pattern

  HIGH (78%) · src/config.py:34
  internal_token = "prod-service-backend-2019"
  ↳ Moderate-risk variable name · Low entropy but sensitive context

2 findings blocked this commit. Fix all issues before committing.

Findings are sorted by confidence descending — the most certain findings appear first, which is where the developer's attention should go.

The commit is blocked if any finding exceeds the threshold, not just the highest-confidence one. A batch of MEDIUM confidence findings is still a blocked commit. If all findings are genuine false positives, they should all be suppressed with justification — not just the top one.

The Suppression UX

The suppression path needs to be low-friction but not invisible. If suppressing a false positive is too hard, developers will use git commit --no-verify to bypass the hook entirely — which defeats the purpose.

The designed flow:

# Developer encounters a false positive:
# file_integrity_hash = "d8e8fca2dc0f896fd7cb4cb0031ba249"  ← flagged

# They add the annotation with a justification:
# MD5 hash for file integrity check only — not a credential
file_integrity_hash = "d8e8fca2dc0f896fd7cb4cb0031ba249"  # secrets-ignore

# Commit proceeds normally on next attempt

The # secrets-ignore annotation is visible in code review. A reviewer can see that a suppression was added and evaluate whether the justification is reasonable. This is the governance layer — suppressions can't happen silently.

The hook also respects the SECRETS_DETECTOR_THRESHOLD environment variable, which allows individual developers to adjust their personal threshold without modifying shared configuration:

# Developer who wants to see more findings (lower threshold)
SECRETS_DETECTOR_THRESHOLD=0.55 git commit -m "wip"

# Developer who wants fewer false positives (higher threshold)
SECRETS_DETECTOR_THRESHOLD=0.85 git commit -m "feature: payment flow"

This flexibility matters for adoption. Some developers will want to see everything; others will want a tighter filter. Forcing everyone to the same threshold is a source of friction.

Installation: Making Setup Frictionless

A hook that's hard to install never gets installed. The setup needs to be one command:

# Using pre-commit framework (recommended)
pip install pre-commit
echo "repos:
- repo: https://github.com/pgmpofu/secrets-detector
  rev: v1.0.0
  hooks:
  - id: secrets-detector
    args: [--threshold, '0.7']" > .pre-commit-config.yaml
pre-commit install

Or manual installation for teams not using the pre-commit framework:

# Copy hook to git hooks directory
cp hooks/pre-commit .git/hooks/pre-commit
chmod +x .git/hooks/pre-commit

The pre-commit framework approach is preferable for teams because it version-pins the hook, makes it part of the repository configuration (.pre-commit-config.yaml is committed), and automatically installs on git clone for new team members. The manual approach works for individual use.

What Happens at `git commit --no-verify`

This is the escape hatch that can't be removed. Git's --no-verify flag bypasses all hooks, and there's nothing a hook can do to prevent it.

The right response to this is not technical — it's cultural and procedural.

In a team setting, git commit --no-verify should require a comment in the commit message explaining why the hook was bypassed. This can be enforced through CI/CD: a pipeline step that checks whether any commit in a PR used --no-verify and requires a justification in the commit message if so.

# In GitHub Actions
- name: Check for hook bypasses
  run: |
    git log --oneline origin/main..HEAD | while read line; do
      hash=$(echo $line | cut -d' ' -f1)
      msg=$(git log --format=%B -n 1 $hash)
      if git log --format=%B -n 1 $hash | grep -q "no-verify bypass"; then
        echo "Documented bypass found in $hash"
      fi
    done

The goal is to make --no-verify traceable, not to make it impossible. A developer in a genuine emergency who needs to commit right now and deal with the secret later should be able to do that — but there should be a record of the decision.

Measuring Hook Effectiveness

After the hook has been running for a few weeks, three metrics tell you whether it's working:

Bypass rate. What percentage of commits use --no-verify? A bypass rate above 10% suggests the hook is generating too many false positives or too much friction. Investigate which developers are bypassing most frequently and why.

Suppression rate. What percentage of findings are suppressed rather than fixed? High suppression rates indicate either noisy rules or developers treating suppression as the default response. Review suppressions in code review and push back on suppression-without-justification.

Secrets found in CI despite the hook. If your CI pipeline also runs a secrets scan and finds things the pre-commit hook didn't catch, those are false negatives worth understanding. Each one is an opportunity to improve the hook's coverage.

The hook is not a complete solution — it's the first line of defence. CI scanning is the second. Periodic full history scanning is the third. Each layer catches what the previous one misses.

The Broader Point: Shift Left Has a UX Requirement

"Shift left" — catching security issues earlier in the development lifecycle — is the right strategy. Every study on the economics of security defects confirms that earlier detection means lower remediation cost.

But shift left only works if the shifted controls are actually used. A pre-commit hook that developers disable after the first false positive has shifted nothing. A CI gate that gets bypassed in every release has shifted nothing.

The investment in UX — the careful output format, the clear suppression path, the fast scan, the explainable findings — is not cosmetic. It's what determines whether the security control actually operates or sits dormant in the repository while credentials quietly accumulate in git history.

Security controls that developers trust are security controls that get used. That's the only metric that matters.

The pre-commit hook implementation is in hooks/pre-commit at github.com/pgmpofu/secrets-detector.

Last article in the series: I ran the secrets detector against my own repositories — here's what it actually found, the false positives I encountered, and what the real-world numbers looked like.

Training on Synthetic Data: How to Build an ML Security Tool Without Touching Real Leaked Secrets

Patience Mpofu — Sat, 16 May 2026 02:51:27 +0000

Before I wrote a single line of model training code, I made a decision that constrained everything that followed.

I would not train on real leaked credentials.

The alternative was straightforward. GitHub's public commit history contains millions of accidentally committed secrets — API keys, passwords, connection strings, private keys — that have been scraped, indexed, and catalogued by security researchers. Datasets of this material exist. Using them as positive training examples would produce a model trained on exactly the kind of data it needs to recognise.

I chose not to do that. And the reasoning is more nuanced than "it felt wrong."

This article is about why I made that choice, how I built synthetic training data that avoids the problem, what the tradeoffs are, and what the broader principle is for anyone building ML security tooling.

Why Real Leaked Credentials Are a Problematic Training Source

The obvious objection to using real leaked secrets is ethical: those credentials belong to real people and organisations. Even if the data is technically public — visible in a GitHub commit, indexed by search engines — using it for commercial or portfolio purposes raises questions about consent and purpose.

But the ethical argument alone isn't the strongest one. The stronger arguments are practical.

Legal Ambiguity

The legal status of scraping and using publicly accessible but unintentionally published credentials is genuinely unclear across jurisdictions. In some interpretations of computer fraud and data protection law, accessing and storing leaked credentials — even for research purposes — could constitute unauthorised access to data or improper processing of personal information.

The GDPR position on this is particularly murky. Credentials are often linked to personal accounts. Processing personal data, even publicly accessible personal data, requires a lawful basis. "I needed it to train my model" is not a lawful basis.

I'm not a lawyer and this isn't legal advice. But I am someone building a tool I intend to put on GitHub with my name on it. "The legal status of my training data is unclear" is not a position I wanted to be in.

Data Quality Problems

Leaked credential datasets have severe quality problems that make them worse training data than they might appear.

Temporal distribution shift. Key formats change over time. GitHub PATs changed format in 2021 from a 40-character hex string to a structured format with a ghp_ prefix. AWS has introduced new key formats. An older leaked credentials dataset would train the model on formats that no longer exist while underrepresenting current formats.

Survivorship bias. The credentials that get scraped and catalogued are the ones that were detected and revoked. Harder-to-detect secrets — generically named variables, low-entropy human-chosen passwords — are systematically underrepresented in public leaked credential datasets precisely because they're harder to find.

Label noise. Not every string in a "leaked credentials" dataset is actually a sensitive credential. Test keys, example values, documentation snippets, and deliberately fake keys appear throughout. Cleaning a scraped dataset to get reliable labels is a substantial manual effort.

Negative example scarcity. A dataset of leaked credentials is purely positive examples. You still need high-quality negative examples — high-entropy strings that aren't secrets — to train a classifier that distinguishes secrets from benign values. These need to be generated separately anyway.

Synthetic data generation, done carefully, avoids all of these problems. You control the format distribution, the label quality, and the class balance precisely.

Reusability and Sharing

A tool trained on synthetic data can be shared freely. The training code can be published. The data generation methodology can be documented. Other researchers can reproduce, audit, and improve the approach.

A tool trained on scraped real credentials has a provenance problem the moment someone asks "where did your training data come from?" Publishing that training data would mean republishing the leaked credentials. Not publishing it means the model can't be fully reproduced or audited.

Reproducibility matters in security tooling specifically because trust matters. A secrets detector that you can't audit end-to-end is a secrets detector you're taking on faith.

How I Generated Synthetic Training Data

The synthetic data generator in trainer.py produces two classes of examples: secrets (label=1) and benign high-entropy strings (label=0).

Generating Positive Examples (Secrets)

For known-format secrets, I generate values that match the structural properties of real secrets without being real secrets:

def generate_aws_access_key():
    """Generate synthetic AWS access key format"""
    chars = string.ascii_uppercase + string.digits
    suffix = ''.join(random.choices(chars, k=16))
    return f"AKIA{suffix}"

def generate_github_pat():
    """Generate synthetic GitHub PAT (new format)"""
    chars = string.ascii_letters + string.digits
    suffix = ''.join(random.choices(chars, k=36))
    prefix = random.choice(['ghp', 'gho', 'ghu', 'ghs', 'ghr'])
    return f"{prefix}_{suffix}"

def generate_jwt():
    """Generate syntactically valid JWT structure"""
    header = base64.urlsafe_b64encode(
        json.dumps({"alg": "HS256", "typ": "JWT"}).encode()
    ).rstrip(b'=').decode()
    payload = base64.urlsafe_b64encode(
        json.dumps({"sub": "1234567890", "iat": 1516239022}).encode()
    ).rstrip(b'=').decode()
    signature = ''.join(random.choices(
        string.ascii_letters + string.digits + '-_', k=43
    ))
    return f"{header}.{payload}.{signature}"

For generic hardcoded credentials — the human-chosen passwords and internal tokens that no regex would catch — I generate values following common human password patterns:

def generate_human_chosen_password():
    """Generate realistic human-chosen passwords"""
    patterns = [
        # Word + year + special
        lambda: f"{random.choice(COMMON_WORDS)}{random.randint(2015, 2024)}{random.choice('!@#$%')}",
        # Capitalised word + number
        lambda: f"{random.choice(COMMON_WORDS).capitalize()}{random.randint(1, 999)}",
        # Two words concatenated
        lambda: f"{random.choice(COMMON_WORDS)}{random.choice(COMMON_WORDS)}",
        # Word + special pattern
        lambda: f"{random.choice(COMMON_WORDS).upper()}_{random.randint(100, 999)}",
    ]
    return random.choice(patterns)()

COMMON_WORDS = [
    "winter", "summer", "spring", "autumn", "admin", "secure",
    "company", "service", "backend", "system", "master", "main",
    "deploy", "production", "staging", "develop", "internal"
]

Each generated secret is paired with a realistic variable name drawn from the high-risk vocabulary:

SECRET_VARIABLE_NAMES = [
    "API_KEY", "api_key", "apiKey",
    "SECRET_KEY", "secret_key", "secretKey",
    "PASSWORD", "password", "passwd", "pwd",
    "ACCESS_TOKEN", "access_token", "accessToken",
    "DATABASE_URL", "database_url", "db_url", "DB_URL",
    "PRIVATE_KEY", "private_key", "privateKey",
    # ... 40+ more
]

The (variable_name, value) pairs that go into training represent the full context the feature extractor sees.

Generating Negative Examples (Benign High-Entropy Strings)

The negative class is where most secrets detectors fail — they don't have enough high-quality negative examples, so the model learns "high entropy = secret" rather than "high entropy in a credential context = secret."

I generate several categories of benign high-entropy strings:

def generate_uuid():
    return str(uuid.uuid4())

def generate_sha256_hash():
    content = ''.join(random.choices(string.printable, k=random.randint(10, 100)))
    return hashlib.sha256(content.encode()).hexdigest()

def generate_md5_hash():
    content = ''.join(random.choices(string.printable, k=random.randint(10, 100)))
    return hashlib.md5(content.encode()).hexdigest()

def generate_base64_data():
    """Simulate base64-encoded image or binary data fragments"""
    data = bytes(random.randint(0, 255) for _ in range(random.randint(20, 60)))
    return base64.b64encode(data).decode()

def generate_package_integrity_hash():
    """npm/yarn integrity hash format"""
    data = bytes(random.randint(0, 255) for _ in range(48))
    hash_val = base64.b64encode(data).decode()
    return f"sha512-{hash_val}"

def generate_hex_color():
    return f"{''.join(random.choices('0123456789abcdef', k=6))}"

def generate_version_string():
    major = random.randint(0, 10)
    minor = random.randint(0, 99)
    patch = random.randint(0, 999)
    return f"{major}.{minor}.{patch}"

Each negative example is paired with a low-risk variable name:

BENIGN_VARIABLE_NAMES = [
    "checksum", "hash", "digest", "fingerprint",
    "uuid", "guid", "id", "identifier", "correlation_id",
    "version", "release", "build_number",
    "color", "colour", "hex_color",
    "integrity", "content_hash",
    # ... 30+ more
]

Class Balance and Distribution

The training set uses a 50/50 class balance — equal numbers of secrets and benign strings. This is a deliberate choice.

Real codebases have far fewer secrets than benign strings — maybe 1% of high-entropy strings are actual secrets in a typical codebase. Training on a 1% positive class would produce a classifier that learns to say "not a secret" almost all the time and achieves 99% accuracy by doing so — completely useless.

A 50/50 balance forces the model to actually learn to distinguish the classes. The resulting classifier has higher false positive rates on real codebases than the training accuracy suggests, which is why the confidence threshold (default 0.7) and the key name feature do so much work in production.

The threshold can be adjusted to trade precision for recall:

# Higher threshold — fewer false positives, more false negatives
python main.py scan ./src --threshold 0.85

# Lower threshold — more findings, more false positives
python main.py scan ./src --threshold 0.55

Validating Synthetic Data Quality

The risk of synthetic data is that it doesn't reflect the distribution of real data. A model trained on synthetic examples might perform well on the test set (also synthetic) and poorly on real codebases.

I validated against three real-world test cases:

Test 1: Known public secret patterns. I collected public documentation examples of secret formats — the example values shown in AWS, GitHub, and OpenAI documentation. These are not real secrets; they're deliberately fake values used in documentation. The model should classify them as secrets (since they match real formats) and does so at >95% confidence.

Test 2: Known benign high-entropy strings. I collected package-lock.json integrity hashes, UUID values from public test suites, and SHA-256 checksums from public software distributions. The model should classify these as benign and does so at <10% confidence in the vast majority of cases.

Test 3: Edge cases from my own code. I scanned my own development projects — including the secrets detector itself — and manually reviewed every finding above 0.5 confidence. This is where real-world calibration happens. The findings from this scan informed several adjustments to the key name vocabulary and confidence thresholds.

The synthetic approach doesn't eliminate the need for this kind of real-world validation. It just means the real-world validation is about calibration rather than about whether the model has learned anything at all.

The Ongoing Data Problem: Concept Drift

Secret formats change. New services launch with new key formats. Existing services rotate their key structures for security reasons. The synthetic data that was representative in 2023 may underrepresent the formats that matter in 2025.

This is the secrets detection equivalent of the vulnerability scanner coverage gap problem — there will always be a lag between a new format appearing in the wild and the tool being updated to detect it.

The response to this is the same as it is for signature-based detection: a clear update process. When a new cloud service launches with a distinctive key format, the update is:

Add a generator function for the new format to trainer.py
Add a pattern match flag to the feature extractor
Retrain: python main.py train --samples 6000
The new format is now detected The synthetic data approach makes this update cycle fast and low-risk. Adding new training examples doesn't require finding or curating real examples of the new format — just implementing its generation logic. Retraining takes seconds. The update can ship as a minor version bump.

Synthetic Data as a Security Research Methodology

Stepping back from this specific tool: the synthetic data approach is applicable to a much broader class of security ML problems.

Phishing email detection can be trained on algorithmically generated phishing templates rather than real phishing emails, which carry real malicious links and attachments.

Malware classification researchers face the same problem I faced — real malware samples are dangerous to handle and distribute. Synthetic malware features derived from known behavioral signatures can substitute for actual samples in feature-level classifiers.

Log anomaly detection for security can use synthetic attack log patterns derived from published attack techniques rather than actual attack logs from production systems.

The common thread: real security data is often sensitive, legally ambiguous, dangerous to handle, or has quality problems that make it worse than it appears. Carefully generated synthetic data, validated against real-world examples without incorporating them into training, is frequently the more practical path.

The tradeoff is always the same: you give up the naturalness of real data distribution in exchange for control, safety, reproducibility, and shareability. For security tooling specifically — where trust and auditability matter — that tradeoff is often worth making.

What I'd Do Differently

If I were building this tool for a commercial security product rather than a portfolio project, I'd approach training data differently in two ways.

Structured negative mining from real codebases. Rather than generating synthetic negative examples, I'd mine real open source repositories for high-entropy strings that are demonstrably not secrets — package hashes, checksums in test suites, example values in documentation. These are safe to use (no real credentials), have the right distribution (they appear in real code as developers write it), and don't require synthetic generation. The labeling work is the constraint, not the data availability.

A small labeled set of real format examples. Not real credentials — but real format examples. The example values in service provider documentation (AWS's AKIAIOSFODNN7EXAMPLE, GitHub's documented PAT format examples) are designed to look like real keys without being real keys. A small set of these, clearly labeled, would improve the model's calibration on the exact formats that matter most.

The synthetic approach I built is the right choice given the constraint of a solo portfolio project with no data labeling resources. A team building a production tool would have access to more options.

The data generation code is in trainer.py at github.com/pgmpofu/secrets-detector. All generators are clearly documented and the entire training pipeline is reproducible from scratch with a single command.

Next up: building the pre-commit hook — blocking secrets before they ever reach the repository, and the UX considerations that determine whether developers actually leave it enabled.

Why I Chose Random Forest Over Deep Learning for Secrets Detection

Patience Mpofu — Sat, 16 May 2026 02:47:49 +0000

Every time I mention that my secrets detector uses a Random Forest classifier, someone asks the same question.

"Why not a neural network?"

It's a reasonable question. Deep learning dominates ML benchmarks. Transformers have redefined what's possible in natural language understanding. If you're building a tool that reads code — which is text — shouldn't you be using the most powerful text understanding architecture available?

The answer is no. And the reasoning reveals something important about how to match ML approaches to real-world engineering constraints.

This article is the full argument: why Random Forest was the right choice for this specific problem, what I give up by not going deep, and when the calculus would flip.

The Five Constraints That Shaped the Decision

Before choosing a model architecture, I defined the constraints the tool had to satisfy. These weren't aspirational — they were hard requirements that would determine whether the tool was actually useful in practice.

Constraint 1: Must run locally with zero infrastructure.
The tool needs to work in a pre-commit hook, on a developer's laptop, without internet access. No API calls, no GPU, no Docker compose stack with a model server. A developer running git commit should not experience meaningful latency.

Constraint 2: Must ship as a self-contained package.
The model file needs to be small enough to live in the repository alongside the code. Teams shouldn't need to download a separate model artifact or manage model versioning separately from tool versioning.

Constraint 3: Must be retrainable by a non-ML engineer.
When a team encounters false positives specific to their codebase, they should be able to add examples and retrain without ML expertise, a GPU, or more than a few minutes of compute time.

Constraint 4: Must explain its decisions.
When the tool flags a finding, an engineer should be able to understand why. "The model said so" is not an acceptable answer in a security context where false positives erode trust.

Constraint 5: Must generalise from a small training set.
I'm training on synthetically generated data, not millions of real examples. The architecture needs to perform well with thousands of samples, not billions.

Every significant architectural decision in this tool flows from these five constraints. Let me show you how.

Why Deep Learning Fails Each Constraint

Constraint 1: Local execution without infrastructure

A production-quality transformer model for code understanding — something like CodeBERT or GraphCodeBERT — runs to hundreds of megabytes to several gigabytes. Running inference on a CPU is possible but slow: several seconds per scan on a typical laptop. In a pre-commit hook, where the developer is waiting at the terminal, that's unacceptable friction.

The Random Forest model runs inference in milliseconds on CPU. Scanning a 10,000-line codebase takes under two seconds on a five-year-old laptop. There's no perceptible delay between git commit and the hook completing.

Constraint 2: Self-contained package

The trained Random Forest model serialises to approximately 1MB as a pickle file. It lives in the model/ directory, ships with the tool, and requires no separate download or version management.

A fine-tuned transformer model would be 400MB–2GB depending on architecture. That's not viable as a repository artifact. It requires separate model hosting, download scripts, and version coordination — none of which a team setting up a pre-commit hook wants to manage.

Constraint 3: Retrainable by non-ML engineers

Retraining the Random Forest on 6,000 samples takes approximately eight seconds on a standard laptop CPU. Scaling to 50,000 samples takes about ninety seconds. The entire workflow is:

# Edit trainer.py to add your examples
# Then:
python main.py train --samples 10000
# Done. New model.pkl in model/

Retraining a transformer requires GPU infrastructure, hours of compute time, careful learning rate scheduling to avoid catastrophic forgetting, and validation that fine-tuning didn't degrade performance on the base cases. A team without an ML engineer cannot do this.

Constraint 4: Explainable decisions

This is where the gap between Random Forest and deep learning is most significant for a security tool.

Random Forest gives you feature importances globally and, with one additional step, per-prediction explanations. When the tool flags a finding, it can tell you exactly which features drove the decision:

Finding: api_key = "sk-proj-abc123XYZ789..."
Confidence: 96%

Contributing features:
  key_name_risk:        0.90  (HIGH — 'api_key' matches sensitive vocabulary)
  shannon_entropy:      5.82  (HIGH — consistent with cryptographic secret)
  pattern_openai_key:   1.00  (MATCH — matches OpenAI key format sk-proj-*)
  repetition_ratio:     0.94  (HIGH — low character repetition, high randomness)

An engineer reading this knows immediately why the finding was generated. They can evaluate whether the reasoning is sound. They can make an informed decision about whether to fix or suppress.

A neural network produces a probability: 0.96. No more. You can apply techniques like SHAP or LIME to approximate explanations, but these add complexity, latency, and approximation error. For a pre-commit hook that needs to explain itself to a developer in real time, "here are the features that drove this" is vastly better than "the attention mechanism focused on these tokens (approximately)."

Constraint 5: Generalisation from small data

Transformer models are data-hungry. They're pre-trained on billions of tokens and fine-tuned on millions of examples. Their power comes from the scale of pre-training, which means fine-tuning on thousands of synthetic examples carries real risk of the model not generalising well to patterns it hasn't seen.

Random Forest with well-engineered features generalises effectively from thousands of examples. The feature engineering does the heavy lifting — entropy, character ratios, key name scoring, pattern flags. The model only needs to learn the relationships between these pre-computed features, which is a much simpler learning problem than learning representations from raw text.

What I Give Up

Intellectual honesty requires being clear about the tradeoffs.

Peak accuracy ceiling. A well-fine-tuned code understanding model operating on token sequences would almost certainly achieve higher peak accuracy than my feature-engineered Random Forest. It would learn representations I haven't thought to engineer explicitly. It would capture multi-token context — the fact that password appears three lines before = "..." rather than directly adjacent, for instance.

Novel format generalisation. When a new cloud provider launches with a distinctive key format, my tool catches it only if I add a pattern match flag. A neural network trained on diverse secret formats might generalise to novel formats by recognising that they "look like secrets" in ways the feature vector doesn't capture. My tool requires an explicit pattern update.

Code context understanding. The feature vector sees one value at a time. A transformer scanning the whole file could understand that a value is being loaded from an environment variable rather than being hardcoded, that it's inside a test mock, or that it's in a comment rather than executable code. My tool handles some of these through pre-processing (only scanning string literals in executable code), but the context window is fundamentally narrower.

Cross-line data flow. If a secret is assembled across multiple lines — partial string concatenation, format strings, bytes operations — the feature vector sees fragments rather than the complete secret. A model with broader context could potentially catch these.

The Accuracy Numbers

On my test set of 1,200 labeled samples (a held-out 20% of the 6,000 training samples), the Random Forest achieves:

Metric	Score
Accuracy	94.2%
Precision	93.8%
Recall	94.7%
F1 Score	94.2%
False Positive Rate	5.8%
False Negative Rate	5.3%

For context: TruffleHog v3 (regex + entropy) reports false positive rates in the 10–15% range on typical codebases according to published evaluations. The ML approach achieves meaningfully better precision without sacrificing recall.

I don't have a head-to-head comparison against a fine-tuned transformer on this specific task — that would require the transformer, the training infrastructure, and a larger labeled dataset than I have. What I can say is that the Random Forest achieves accuracy that's competitive with existing tools, meets all five operational constraints, and does so at a fraction of the complexity.

The Decision Framework: When Would I Choose Deep Learning?

Given all of the above, there are scenarios where I would choose a different architecture.

If I were building a cloud-hosted scanning service, the infrastructure constraint disappears. GPU inference is available. Model size doesn't matter. Latency can be managed with caching and batching. In that scenario, a transformer-based approach becomes viable and the accuracy ceiling argument gets stronger.

If I had a large labeled dataset of real secrets, the data constraint relaxes. Fine-tuning on tens of thousands of real examples would likely push accuracy significantly higher than what synthetic data training achieves. The question then becomes whether the accuracy gain justifies the operational complexity.

If the primary use case were batch scanning rather than pre-commit hooks, the latency constraint loosens. Scanning a repository's entire history overnight can tolerate seconds or minutes per file. The pre-commit use case is what drives the millisecond inference requirement.

If cross-file context mattered, a graph neural network operating on the code's data flow graph might be more appropriate than either approach. Understanding that secret = get_secret_from_vault() is safe and secret = "hardcoded" is dangerous requires understanding function call semantics — something Random Forest on string features cannot do.

The right architecture is always determined by the constraints of the deployment context, not by what achieves the best benchmark score.

The Broader Principle: Fit for Purpose Over State of the Art

The machine learning community has a bias toward the most powerful available architecture. More parameters, more data, more compute — these are treated as virtues in research contexts where they often are virtues.

Production engineering has different values. A tool that actually gets used — because it's fast, explainable, maintainable, and deployable without infrastructure — delivers more security value than a theoretically superior tool that sits unused because it's too slow, too opaque, or too complex to operate.

This is an instance of a general principle I keep encountering in AppSec: the best security control is the one that gets implemented and maintained, not the one that provides the strongest theoretical protection.

A Random Forest secrets detector running in every developer's pre-commit hook, catching 94% of secrets before they reach the repository, is more valuable than a transformer-based detector achieving 98% accuracy that nobody bothered to deploy because the setup was too complicated.

The 4% accuracy difference is real. The deployment difference is everything.

What the Feature Importances Reveal About the Problem

One thing Random Forest gives you that deep learning doesn't: a clear picture of what the problem actually is.

Here are the top 10 feature importances from the trained model:

Rank	Feature	Importance
1	`key_name_risk`	0.28
2	`shannon_entropy`	0.14
3	`pattern_aws_access_key`	0.09
4	`repetition_ratio`	0.08
5	`hex_ratio`	0.07
6	`pattern_github_pat`	0.06
7	`base64_ratio`	0.05
8	`log_length`	0.04
9	`pattern_private_key_header`	0.04
10	`uppercase_ratio`	0.03

This table is a map of the secrets detection problem. It tells you that variable naming context is more predictive than any statistical property of the string itself. It tells you that entropy matters but not as much as everyone assumes. It tells you that AWS and GitHub keys are important enough that their specific pattern flags appear in the top ten even though there are 16 pattern flags spread across the remaining importance budget.

A neural network would learn similar underlying structure — it would attend more to variable names than to arbitrary string characters — but it wouldn't show you that structure explicitly. The interpretability of Random Forest turns model training into a research exercise as well as an engineering one.

That visibility into what the problem actually is informed every design decision in this tool. It's one of the most valuable things the architecture choice gave me.

The trainer code, feature importances, and model evaluation scripts are all in the repository at github.com/pgmpofu/secrets-detector.

Next up: the ethical and practical challenge of training a security ML model without using real leaked credentials — why synthetic data, how I generated it, and what the tradeoffs are.

Why the Variable Name Is the Most Important Feature in Secrets Detection

Patience Mpofu — Thu, 14 May 2026 02:26:43 +0000

ere's a question that sounds trivial until you think about it carefully.

Are these two lines of code equally dangerous?

checksum = "d8e8fca2dc0f896fd7cb4cb0031ba249"
password = "d8e8fca2dc0f896fd7cb4cb0031ba249"

The string value is identical. The entropy is identical. Every character-level feature is identical. A regex scanner treats them the same. A pure entropy scanner treats them the same. A human security engineer does not treat them the same — not even slightly.

The first is almost certainly a file integrity hash. The second is almost certainly an exposed credential. The only difference is the four characters before the equals sign.

When I trained my secrets detector and examined the feature importances, the variable name risk score came out at 0.28 — higher than Shannon entropy, higher than all character distribution features, higher than string length. The single most predictive signal for whether a string is a secret is not the string itself. It's what the developer named the variable holding it.

This article is about what that finding reveals — about how secrets detection actually works, about how developers accidentally expose credentials, and about what it means for how we should think about this entire problem class.

Why Feature Importance of 0.28 Is Remarkable

In a Random Forest model, feature importance is measured by how much each feature reduces impurity across all decision trees. An importance of 0.28 out of 1.0, across 26 features, means the variable name alone accounts for more than a quarter of the model's predictive power.

To put that in context: if you removed every other feature and kept only the variable name, you'd still have a classifier that makes correct decisions on the majority of cases. If you kept every other feature and removed the variable name, you'd lose more predictive power than any other single change.

That's not what I expected when I designed the feature vector. I expected entropy to dominate — it's the signal that most secrets detection literature focuses on. The finding that variable names outperform entropy forced me to rethink some assumptions about the problem.

What Variable Names Actually Encode

Variable names in production code are not arbitrary. They're communication.

When a developer writes api_key = "...", they're not just labelling a memory location. They're documenting their intent. They're telling the next engineer — and, it turns out, a machine learning classifier — that this value is an API key, that it's sensitive, that it should be treated as a secret.

Developers are remarkably consistent about this. Across codebases, languages, and organisations, the same small vocabulary appears around credential storage:

password, passwd, pwd
secret, secret_key, client_secret
api_key, apikey, api_token
token, access_token, auth_token, bearer_token
private_key, privkey, pem
credential, credentials, creds
database_url, db_url, connection_string

And a complementary vocabulary appears around non-sensitive high-entropy strings:

checksum, hash, digest, fingerprint
uuid, guid, id, identifier
version, release, build
color, colour, hex
integrity, signature (in package manifest contexts)

The signal isn't perfect — id sometimes refers to a sensitive identifier, token is sometimes used for pagination tokens or CSRF tokens that aren't secrets in the traditional sense. But the correlation between variable name semantics and actual sensitivity is strong enough to be the most predictive single feature in a 26-dimensional model.

The Three Ways Developers Name Credential Variables

Understanding how variable names signal secrets requires understanding the patterns developers actually use. In practice, there are three distinct naming patterns, each with different detection implications.

Pattern 1: Direct and Obvious

The developer uses a name that directly and unambiguously identifies the value as sensitive:

STRIPE_SECRET_KEY = "sk_live_abc123..."
DATABASE_PASSWORD = "Tr0ub4dor&3"
GITHUB_ACCESS_TOKEN = "ghp_abc123..."
JWT_SECRET = "my-super-secret-signing-key"

These are the easy cases. The variable name scores maximum risk, the classifier is highly confident, and the finding is genuine in nearly every instance. There's no ambiguity to resolve.

Pattern 2: Abbreviated and Conventional

The developer uses a shortened or conventional form that's recognisable within the development community but might be less obvious to an outsider:

DB_PASS = "Winter2019!"          # "PASS" → password
AWS_SK = "wJalrXUtn..."          # "SK" → secret key
OAUTH_CS = "abc123def456..."     # "CS" → client secret
SVC_PWD = "service_password_1"  # "PWD" → password

The risk scoring function handles these through substring matching and a vocabulary of abbreviations. PASS, PWD, SK (in certain contexts), CS, TKN all score high. This is where coverage gaps can appear — an unusual abbreviation in a domain-specific codebase might not be in the vocabulary.

Pattern 3: Contextually Sensitive

The developer uses a name that doesn't obviously indicate sensitivity on its own but becomes sensitive in context:

# In a payment processing module
value = "sk_live_abc123..."       # "value" alone scores 0.1

# In a configuration dictionary
config = {
    "key": "AKIAIOSFODNN7EXAMPLE"  # "key" alone scores 0.7
}

# In a function parameter
def authenticate(token):           # "token" scores 0.9
    headers = {"Authorization": f"Bearer {token}"}

These cases are where the feature vector struggles most. The variable name score for value is 0.1 — weak evidence of sensitivity. In isolation, the classifier would likely pass this. But if the string value itself has high entropy and matches a known pattern (like the Stripe key format), the pattern flags compensate.

This interaction — where a weak key name score is overridden by strong pattern match flags — is exactly how the feature vector is supposed to work. No single feature dominates all cases. The combination handles cases that individual features miss.

The Accidental Exposure Psychology

The variable name finding reveals something important about how secrets end up in code in the first place.

Developers don't accidentally commit secrets because they don't know secrets are sensitive. They commit secrets because the friction of not committing them is high at the moment of writing.

The archetypal scenario:

Developer is building a feature that needs an API key
Developer adds the key directly to the code to get the feature working
Developer intends to "move it to environment variables later"
Developer commits the working code
"Later" never comes, or the commit is already in history The variable name in this scenario is almost always informative — the developer names it API_KEY or STRIPE_KEY or DB_PASSWORD because they know exactly what it is. They're not hiding it from themselves. They're just deferring the cleanup.

This is why the variable name is such a strong signal. The developer's intent is encoded in the name. The developer knew this was a secret when they wrote it. The name reflects that knowledge.

The cases where variable names are not informative are the cases where secrets end up in code through a different mechanism — configuration files that get accidentally committed, environment files that get accidentally tracked, secrets embedded in test data that wasn't meant to be realistic. These are harder to catch and require the entropy and pattern features to carry more weight.

Where Variable Name Scoring Breaks Down

Being precise about the limits of this approach matters.

Obfuscated Names

An attacker who knows about this detection vector could theoretically name credential variables to avoid detection:

# Designed to evade variable name scoring
x = "AKIAIOSFODNN7EXAMPLE7"
data_1 = "sk_live_abc123def456ghi789"
temp = "my-database-password-2024"

The first two would still be caught by pattern match flags — the AWS key format and Stripe key format are distinctive enough that entropy and pattern features alone classify them correctly. The third — a human-chosen password stored in a variable named temp — would be at risk of being missed if the entropy is low.

This is a real gap. In practice, deliberately obfuscated variable names are uncommon in the codebases secrets appear in — developers who are trying to hide secrets from their own team are a different threat model than developers who accidentally expose secrets. But the gap exists.

Generic Framework Patterns

Frameworks and ORMs often use generic variable names that pattern-match to high risk scores without holding sensitive values:

# Django ORM — "password" is a field name, not a credential
class User(models.Model):
    password = models.CharField(max_length=128)  # hashed, not plaintext

# Spring Security — "token" is a parameter name
public ResponseEntity<?> authenticate(@RequestBody TokenRequest token) {

In these cases, the value being assigned is a class reference, a method call, or a parameter object — not a string literal containing a secret. The feature extraction pipeline only runs on string literals, so these cases are largely handled correctly at the scanning stage before feature extraction begins.

Internationalised Codebases

The variable name risk vocabulary is English-only. A codebase with variable names in German (Passwort), French (motDePasse), or Portuguese (senha) will have those variables score the default 0.3 rather than 1.0. This is a genuine coverage gap for multinational organisations. Extending the vocabulary to cover common credential-related terms in other languages would be a meaningful improvement.

The Implication for Secrets Management Culture

The variable name finding has an implication beyond detection accuracy. It tells us something about where to focus prevention efforts.

If developers consistently name their credential variables correctly — if DB_PASSWORD always contains a database password and checksum never does — then the signal for detection is strong. The corollary is that the naming is correct precisely because the developer knows the value is sensitive.

This means secrets in code are not primarily a knowledge problem. Developers who commit secrets usually know they're committing secrets. The problem is friction — the path of least resistance at the moment of writing is to hardcode the value and deal with it later.

The most effective prevention isn't education about what a secret is. It's reducing the friction of not hardcoding secrets in the first place. Pre-commit hooks that block commits before they reach the repository — rather than scanners that find secrets after the fact — address the friction problem directly.

Which is exactly what the pre-commit hook in this tool is designed to do. Catch it at the moment of commit, when the developer already knows the variable is named API_KEY and the value looks like a real key. That's the lowest-friction intervention point — a message at the moment of action rather than a report discovered in a CI pipeline hours later.

Improving the Key Name Feature

The current implementation is a static keyword vocabulary with manually assigned scores. It works well but has obvious limitations — it doesn't learn, it doesn't generalise to novel terms, and it requires manual updates when new credential-adjacent terminology emerges.

A more sophisticated approach would train a small embedding model on variable names from open source code, clustering semantically similar names and learning the association between name semantics and credential presence. Something like word2vec trained on a corpus of variable names from public repositories would generalise to svc_acct_passwd or oauth2_bearer_tkn without requiring explicit vocabulary entries.

That's a meaningful improvement but a substantial increase in complexity. The static vocabulary approach handles the vast majority of real-world cases well enough that the engineering investment in embeddings hasn't been justified yet.

The right trigger for that investment would be systematic measurement of false negatives — cases where the classifier misses real secrets. If those misses cluster around variable naming patterns that aren't in the current vocabulary, the embeddings approach becomes worth building.

What This Tells Us About Security Signal in General

The variable name finding is an instance of a broader principle that I keep encountering in application security work: the most useful signals are often not in the content itself, but in the metadata around the content.

The string "d8e8fca2dc0f896fd7cb4cb0031ba249" carries no information about whether it's sensitive. The variable name password carries almost complete information. The content is identical; the context is everything.

This same principle appears elsewhere in AppSec:

HTTP request bodies are harder to classify as malicious than request paths, because paths carry semantic intent
Log anomaly detection is more effective on log metadata (frequency, source, timing) than log content
Phishing detection is more accurate on sender domain patterns than email body content The implication for building security tools is consistent: don't just look at the data. Look at what surrounds the data. Look at what the developer named it, where it came from, when it appeared, what called it.

Context is signal. Often it's more signal than the content itself.

The full key name risk scoring implementation is in secrets_detector/features.py at github.com/pgmpofu/secrets-detector.

Next up: why I chose Random Forest over deep learning for this problem — the full engineering tradeoffs argument, including interpretability, model size, training speed, and what you give up by not going deep.

The 26-Dimensional Feature Vector: How a Machine Learns to Recognise a Secret

Patience Mpofu — Thu, 14 May 2026 02:23:02 +0000

hen my secrets detector evaluates a candidate string, it doesn't see code.

It sees a vector of 26 numbers.

That vector is the bridge between human intuition — "this looks like a secret" — and machine classification. Every insight a security engineer uses when reading code to spot exposed credentials has been translated into a numerical feature that the Random Forest classifier can reason about.

This article is a complete walkthrough of those 26 features: what each one measures, why it matters, what it catches, and what it misses. By the end, you'll understand exactly what the model sees when it evaluates any candidate value — and why the combination of features catches things that no single signal could.

How Feature Extraction Works

Before the classifier sees anything, every candidate string goes through a feature extraction pipeline in features.py. The pipeline takes two inputs: the string value itself, and the name of the variable holding it.

def extract_features(value: str, key_name: str) -> np.ndarray:
    features = []

    # Entropy features
    features.append(shannon_entropy(value))
    features.append(math.log(len(value) + 1))
    features.append(repetition_ratio(value))
    features.append(longest_run_normalized(value))

    # Character distribution features (8 features)
    features.extend(character_ratios(value))

    # Key name context
    features.append(key_name_risk_score(key_name))

    # Pattern match flags (16 features)
    features.extend(pattern_match_flags(value))

    return np.array(features)

The output is a fixed-length array of 26 floating point numbers. The classifier never sees the original string — only this vector. That's both a strength (the model generalises across different string formats) and a limitation (some context that a human would use is deliberately excluded).

Let me walk through each group.

Group 1: Entropy Features (4 features)

These four features capture the statistical "randomness" of the string — the property that real secrets share with random data.

Feature 1: Shannon Entropy

Shannon entropy measures the unpredictability of the character sequence. For a string of length n with character frequencies p_i:

H = -Σ p_i × log₂(p_i)

A perfectly random string of alphanumeric characters has entropy around 5.7–6.0 bits. Common English words have entropy around 3.5–4.5 bits. Cryptographically generated secrets cluster at the high end.

# High entropy — likely a secret or hash
"sk-proj-abc123XYZ789..." → entropy: 5.82

# Low entropy — likely a password or human-chosen value  
"Winter2019!"            → entropy: 3.21

# Very low entropy — definitely not a secret
"aaaaaaaaaaaaa"          → entropy: 0.00

Entropy alone is a weak classifier — UUIDs, SHA-256 hashes, and base64 image data all have high entropy but are not secrets. That's why it's one of 26 features, not the only feature.

Feature 2: Log-Scaled Length

Raw string length would give too much weight to very long strings. Log-scaling (math.log(len(value) + 1)) compresses the range so that the difference between a 32-character key and a 64-character key has roughly the same weight as the difference between a 4-character and 8-character string.

Secrets tend to fall in predictable length ranges: AWS access keys are 20 characters, GitHub PATs are 40, JWT tokens are variable but typically 200+. Length contributes signal, but it's a soft signal — there's no length that definitively indicates "secret."

Feature 3: Repetition Ratio

repetition_ratio = len(set(value)) / len(value)

This is the proportion of unique characters to total characters. A perfectly random string of 32 characters will have close to 32 unique characters (ratio ≈ 1.0). A string like "aababcababc" has low unique character count relative to its length (ratio ≈ 0.3).

Low repetition ratio is a strong signal that a string is not a secret — real secrets don't repeat characters predictably. High repetition ratio is a necessary but not sufficient condition for being a secret.

Feature 4: Longest Run (Normalised)

The length of the longest consecutive run of the same character, divided by string length:

longest_run = max(len(list(g)) for _, g in itertools.groupby(value))
longest_run_normalized = longest_run / len(value)

"aaabbbccc" has a longest run of 3 out of 9 characters — normalised run of 0.33.
"sk-abc123XYZ789def456" has a longest run of 1 out of 21 characters — normalised run of 0.05.

Long runs of repeated characters are a strong signal of non-random data. No cryptographically generated secret will have a long run. Human-readable strings often will.

Group 2: Character Distribution Features (8 features)

These eight features describe the composition of the string across character classes. Together they capture the "shape" of the character set that a human eye uses to distinguish secrets from benign strings.

Feature	What It Measures
`uppercase_ratio`	Proportion of A–Z characters
`lowercase_ratio`	Proportion of a–z characters
`digit_ratio`	Proportion of 0–9 characters
`special_ratio`	Proportion of non-alphanumeric characters
`hex_ratio`	Proportion of valid hexadecimal characters (0–9, a–f, A–F)
`base64_ratio`	Proportion of base64-safe characters (alphanumeric + /+=)
`printable_ratio`	Proportion of printable ASCII characters
`whitespace_ratio`	Proportion of whitespace characters

Why these specific ratios matter:

hex_ratio is particularly useful for distinguishing hash values from secrets. A SHA-256 hash has a hex_ratio of 1.0 — every character is a valid hex digit. An AWS access key has a hex_ratio of approximately 0.6 (uppercase letters reduce it). A JWT token has a hex_ratio near 0.0 (it's base64url-encoded, using characters outside the hex alphabet).

special_ratio catches secrets that include special characters — a strong signal for human-chosen passwords ("P@ssw0rd!") versus machine-generated tokens (which typically avoid special characters for compatibility reasons).

base64_ratio is the mirror of hex_ratio for base64-encoded content. Base64-encoded image data has a base64_ratio near 1.0. An API key that uses only alphanumeric characters has a high base64_ratio too — which is where the key name and other features need to disambiguate.

The classifier learns the interaction between these ratios. A string with high entropy, high hex_ratio, and a key name that scores 0.0 is almost certainly a hash. A string with high entropy, mixed character ratios, and a key name that scores 1.0 is almost certainly a secret.

Group 3: Key Name Risk Score (1 feature)

This is the single most important feature in the model — feature importance 0.28, more than the entropy and character features combined.

KEY_NAME_RISK = {
    # Score 1.0 — unambiguously sensitive
    "password": 1.0, "passwd": 1.0, "secret": 1.0,
    "private_key": 1.0, "privkey": 1.0,

    # Score 0.9 — very likely sensitive  
    "api_key": 0.9, "apikey": 0.9, "token": 0.9,
    "credential": 0.9, "auth_token": 0.9,

    # Score 0.85 — likely sensitive
    "access_key": 0.85, "client_secret": 0.85,
    "bearer": 0.85, "authorization": 0.85,

    # Score 0.7 — possibly sensitive
    "key": 0.7, "auth": 0.7, "login": 0.7,

    # Score 0.2 — unlikely sensitive
    "config": 0.2, "setting": 0.2, "value": 0.1,

    # Score 0.0 — not sensitive
    "checksum": 0.0, "hash": 0.0, "version": 0.0,
    "id": 0.0, "uuid": 0.0, "color": 0.0
}

def key_name_risk_score(key_name: str) -> float:
    normalised = key_name.lower().strip("_")
    for keyword, score in KEY_NAME_RISK.items():
        if keyword in normalised:
            return score
    return 0.3  # Unknown key names get a moderate default

The scoring function does substring matching, so DB_PASSWORD, database_password, and user_passwd all score 1.0. API_KEY_V2 and service_api_key both score 0.9.

Unknown variable names — ones that don't contain any recognised keyword — get a default score of 0.3. This is deliberately moderate: an unknown variable name is mild evidence that the string might not be sensitive (if it were, it would likely have a recognisable name), but it's not strong evidence either way.

The impact of this feature on classification decisions is substantial:

# Same value, wildly different classifications
password = "d8e8fca2dc0f896fd7cb4cb0031ba249"  # → flagged at 94% confidence
checksum = "d8e8fca2dc0f896fd7cb4cb0031ba249"  # → passed at 8% confidence

Without the key name feature, these two lines are identical to the classifier. With it, they're completely distinguishable.

Group 4: Pattern Match Flags (16 features)

These are binary features — 0 or 1 — indicating whether the value matches any of 16 known secret format patterns.

Flag	Pattern
`pattern_aws_access_key`	`AKIA[0-9A-Z]{16}`
`pattern_github_pat`	`gh[pousr]_[A-Za-z0-9]{36}`
`pattern_github_fine_grained`	`github_pat_[A-Za-z0-9]{82}`
`pattern_jwt`	`eyJ[A-Za-z0-9-_]+\.[A-Za-z0-9-_]+\.[A-Za-z0-9-_]+`
`pattern_openai_key`	`sk-[A-Za-z0-9]{48}`
`pattern_slack_token`	`xox[baprs]-[A-Za-z0-9-]+`
`pattern_stripe_secret`	`sk_live_[A-Za-z0-9]{24}`
`pattern_stripe_publishable`	`pk_live_[A-Za-z0-9]{24}`
`pattern_google_api`	`AIza[0-9A-Za-z-_]{35}`
`pattern_heroku_api`	`[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}`
`pattern_private_key_header`	`-----BEGIN (RSA\
{% raw %}`pattern_db_connection`	`(postgresql\
{% raw %}`pattern_basic_auth`	`[A-Za-z0-9+/]{20,}={0,2}` (base64 basic auth)
`pattern_bearer_token`	`Bearer [A-Za-z0-9-._~+/]+=*`
`pattern_hex_key_32`	`[0-9a-f]{32}` (32-char hex — common key length)
`pattern_hex_key_64`	`[0-9a-f]{64}` (64-char hex — SHA-256 length)

When any of these flags fire, the classifier has strong prior evidence that the value is a known secret format. A value that matches pattern_aws_access_key will be classified as a secret at very high confidence regardless of what the other features say.

The last two flags — pattern_hex_key_32 and pattern_hex_key_64 — deserve special mention. These match the lengths of common cryptographic keys but also match MD5 and SHA-256 hashes, which are not secrets. This is where the key name feature does critical disambiguation work: a 32-character hex string with key name checksum has pattern_hex_key_32 = 1 but key_name_risk = 0.0, and the classifier correctly passes it. The same string with key name encryption_key gets flagged.

How the Features Interact: Three Case Studies

Understanding individual features is useful. Understanding how they interact is where the real insight lives.

Case Study 1: The Human-Chosen Password

SMTP_PASSWORD = "Winter2019!"

Feature	Value	Signal
`shannon_entropy`	3.4	Weak — below threshold for "looks random"
`repetition_ratio`	0.91	Neutral
`special_ratio`	0.09	Slightly elevated
`key_name_risk`	1.0	Very strong — "password" scores maximum
`pattern_*` flags	All 0	No known format match
Classification	Secret — 91% confidence

The entropy would cause a pure entropy scanner to miss this. The key name saves it.

Case Study 2: The UUID False Positive

session_correlation_id = "550e8400-e29b-41d4-a716-446655440000"

Feature	Value	Signal
`shannon_entropy`	4.1	Moderate — looks somewhat random
`hex_ratio`	0.89	Very high — almost all hex characters
`special_ratio`	0.08	Low — only hyphens
`key_name_risk`	0.0	Minimal — "id" scores 0.0
`pattern_heroku_api`	1	Fires — Heroku API keys are UUID-format
Classification	Benign — 23% confidence

The pattern flag fires (Heroku API keys look like UUIDs), but the key name score is 0.0 and the classifier correctly suppresses the finding. A regex scanner using only the Heroku pattern would flag this. The ML classifier does not.

Case Study 3: The Ambiguous High-Entropy String

encryption_key = "a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6"

Feature	Value	Signal
`shannon_entropy`	3.9	Moderate
`hex_ratio`	1.0	Maximum — pure hex
`key_name_risk`	0.9	Very high — "key" scores 0.7, "encryption" modifier pushes it higher
`pattern_hex_key_32`	1	Fires — 32-char hex matches
`repetition_ratio`	0.44	Low — repeating pattern visible
Classification	Secret — 78% confidence

The repetition ratio is low (the value has a repeating a1b2c3... pattern that reduces uniqueness), which pulls the confidence down from what it would be for a truly random key. But the key name and pattern flag are strong enough to push it above the reporting threshold. The finding would be reported at MEDIUM confidence — a prompt for human review rather than a guaranteed finding.

What the Feature Vector Cannot See

Intellectual honesty requires being clear about the limits.

Cross-variable context. The feature vector sees one value at a time. It can't see that key = config["encryption_key"] is loading the key from a config object rather than hardcoding it. A human engineer would immediately see that's not a hardcoded secret; the feature vector has no way to represent that.

File context. The feature vector doesn't know it's in a test file, a mock object, or a README code example. TEST_API_KEY = "fake-key-for-testing" might have a high key name risk score despite being explicitly for testing. The inline suppression annotation (# secrets-ignore) is the escape hatch for this case.

Semantic intent. version = "1.0.0" will correctly score a low key name risk. But release_token = "1.0.0-beta" might score higher because "token" is a high-risk keyword — even though in context this is clearly a version string. The feature vector sees the word "token" in the variable name without understanding that it's used semantically differently here.

These limitations are why the classifier is a signal generator rather than an oracle. Every finding above the confidence threshold warrants a human review. The classifier reduces the review burden dramatically — from "look at every high-entropy string in the codebase" to "look at these 20 high-confidence findings" — but it doesn't eliminate the need for human judgment.

Retraining on Your Own Data

The feature vector approach makes retraining practical in a way that deep learning approaches don't. Because the features are hand-engineered and interpretable, adding new training samples has predictable effects.

If your codebase has a pattern of false positives — say, your internal logging library uses variable names like log_token that consistently score high key name risk despite being benign — you can add synthetic examples of that pattern to the benign training set and retrain in seconds:

# Add your custom generators to trainer.py, then:
python main.py train --samples 5000

The retrained model immediately incorporates your organisation-specific context. That's a capability that's practically unavailable with regex-based tools (you'd have to modify pattern files and accept increased miss rates) and theoretically possible but operationally impractical with deep learning (retraining takes hours and requires ML expertise).

The complete feature extraction code is in secrets_detector/features.py at github.com/pgmpofu/secrets-detector.

Next up: why the variable name is the single most important feature in secrets detection — and what that tells us about how developers accidentally expose credentials.

Why I Built an ML-Powered Secrets Detector Instead of Just Using Regex

Patience Mpofu — Sun, 10 May 2026 15:40:08 +0000

ost secrets scanners work the same way.

They maintain a list of regex patterns — one for AWS access keys, one for GitHub personal access tokens, one for Stripe keys, one for JWT headers — and they scan your code looking for matches. When a pattern fires, they report a finding. When it doesn't, they stay silent.

This works well for secrets that have distinctive, consistent formats. An AWS access key always starts with AKIA followed by 16 uppercase alphanumeric characters. A GitHub PAT has a recognisable prefix. A private key has a PEM header. Regex catches these reliably.

But it's only part of the problem. And the part it misses is exactly where real breaches happen.

This is the story of why I built a machine learning secrets detector — what the existing approaches get wrong, what ML adds, and what the combined system catches that neither approach catches alone.

The Two Failure Modes of Existing Tools

Before building anything, I spent time understanding where the leading tools fail. TruffleHog, detect-secrets, and Gitleaks are all excellent tools. They're also all vulnerable to the same two failure modes in different proportions.

Failure Mode 1: The Regex Gap

Regex-only scanners miss secrets that don't match a known pattern.

The most dangerous class of missed secrets is the generic hardcoded credential — a password, database URL, or internal API key that doesn't follow any publicly documented format because it was generated internally.

# No regex pattern catches this reliably
DB_PASSWORD = "Tr0ub4dor&3"
INTERNAL_API_KEY = "prod-backend-service-key-2019"
SMTP_PASSWORD = "companyname_mail_2018!"

These are real secrets. They're low entropy by the standards of a cryptographically random key. They don't match any known service's key format. A regex scanner walks past them silently.

This is not a theoretical concern. A significant proportion of credential exposures in real breaches involve exactly this type of secret — human-chosen passwords and internal tokens that were never designed to be detected by pattern matching.

Failure Mode 2: The Entropy False Positive Flood

Some tools compensate by flagging anything with high Shannon entropy — the reasoning being that secrets are random, and random strings have high entropy.

This is directionally correct and practically unusable in many codebases.

High-entropy strings that are not secrets appear constantly in normal code:

# UUID — high entropy, not a secret
session_id = "550e8400-e29b-41d4-a716-446655440000"

# SHA-256 hash — very high entropy, not a secret
expected_checksum = "d8e8fca2dc0f896fd7cb4cb0031ba249"

# Base64-encoded image data — extremely high entropy, not a secret
avatar_placeholder = "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJ..."

# Package integrity hash — high entropy, not a secret
integrity = "sha512-abc123def456..."

A pure entropy scanner flags all of these. In a Node.js project with a package-lock.json, an entropy scanner generates thousands of findings from integrity hashes alone. Engineers learn to ignore it within a week.

What ML Adds: Context-Aware Classification

The insight that drove the ML approach is that whether a string is a secret depends on context, not just the string itself.

d8e8fca2dc0f896fd7cb4cb0031ba249 is either a secret or a benign hash depending on what variable contains it. A human security engineer can tell these apart instantly by reading the surrounding code. A regex scanner and an entropy scanner cannot.

The question I asked was: can I teach a classifier to do what a human engineer does — look at the full context of a string and make a judgment about whether it's a secret?

The answer turned out to be yes, with a 26-dimensional feature vector that captures what a human eye actually processes when making that judgment.

Here's the comparison that drove the design:

Approach	Catches High-Entropy Secrets	Catches Low-Entropy Secrets	False Positive Rate
Regex only	Yes (known formats)	No	Low
Entropy only	Yes	No	Very high
ML classifier	Yes	Yes	Significantly reduced

The ML classifier doesn't replace regex — it adds a second layer. Known-format secrets (AWS keys, GitHub PATs, JWTs) are still caught by pattern flags that are part of the feature vector. Generic hardcoded credentials that no regex would catch are caught by the combination of entropy, character distribution, and — most importantly — the variable name context.

The Feature That Changed Everything: Key Name Risk

When I looked at feature importances after training the initial model, one feature stood above all others: key_name_risk, with an importance score of 0.28 out of 1.0.

That's the variable name. Not the value — the name of the variable holding the value.

This makes intuitive sense once you see it. These two lines of code contain the same string value:

checksum = "d8e8fca2dc0f896fd7cb4cb0031ba249"
password = "d8e8fca2dc0f896fd7cb4cb0031ba249"

A human engineer looks at these and immediately knows: the first is almost certainly a hash, the second is almost certainly a secret. The string itself carries no information about its purpose. The variable name carries everything.

I built a risk scoring function that assigns numerical scores to variable names based on their semantic association with sensitive data:

password, passwd, secret, private_key → score 1.0
api_key, token, credential, auth → score 0.9
access_key, client_secret, bearer → score 0.85
config, setting, value → score 0.1
checksum, hash, version, id → score 0.0 The classifier learns to combine this score with the entropy and character distribution features to make decisions that mirror what a human reviewer would make.

The result: password = "abc123" gets flagged despite low entropy. checksum = "d8e8fca2dc0f896fd7cb4cb0031ba249" gets passed despite high entropy. Neither outcome is achievable with regex or entropy alone.

Why Random Forest, Not a Neural Network

When people hear "ML classifier," they often assume deep learning. I chose Random Forest deliberately, and it's worth explaining why.

Interpretability. A Random Forest tells you exactly why it made a decision — which features contributed how much to a particular classification. When an engineer asks "why did the scanner flag this?", I can show them the feature breakdown: high entropy (0.82), key name risk (0.95), matches JWT pattern (true). A neural network produces a probability with no explanation.

Size. The trained model is approximately 1MB as a pickle file. It ships with the tool, requires no internet connection, and adds negligible overhead to a scan. A neural network of sufficient sophistication would be orders of magnitude larger.

Training speed. The model trains on 6,000 labeled samples in seconds on a standard laptop CPU. No GPU required. This matters enormously for the retraining feature — teams can add their own training samples and retrain in their local environment without specialist infrastructure.

No overfitting on small data. With 6,000 training samples — which is small by deep learning standards — Random Forest generalises better than a neural network would. The structured feature engineering does the heavy lifting; the model itself doesn't need to be sophisticated.

The tradeoff is ceiling accuracy. A neural network operating on raw token sequences would likely achieve higher peak accuracy given sufficient data. But for a tool that needs to be deployable, explainable, and retrainable by a team without ML expertise, Random Forest is the right choice.

Synthetic Training Data: The Ethical Constraint

One early design decision shaped everything else: I would not train on real leaked secrets from public repositories.

The alternative — scraping GitHub for accidentally committed credentials and using them as positive training examples — is technically straightforward and has been done. It's also legally and ethically problematic. Those credentials belong to real people and organisations. Even if the data is technically public, using it to train a commercial tool raises questions I didn't want to answer.

Instead, I built a synthetic data generator that produces realistic examples of both secrets and benign high-entropy strings:

Secrets (label=1): Algorithmically generated AWS access keys, GitHub PAT formats, JWT structures, OpenAI key formats, Slack tokens, database connection strings, and — critically — synthetically generated "human-chosen" passwords that follow common patterns without being anyone's real password.

Benign (label=0): UUIDs, MD5 and SHA-256 hashes, version strings, base64-encoded image data fragments, color hex codes, package integrity hashes, lorem ipsum text fragments.

The synthetic approach has one significant advantage beyond ethics: I can generate unlimited training data and precisely control the class distribution. The 6,000 sample baseline can be scaled to 50,000 samples with a single command, which meaningfully improves model accuracy on edge cases.

The Three-Layer Detection Architecture

The final tool combines three detection mechanisms, each compensating for the others' weaknesses:

Layer 1 — Pattern matching flags. Sixteen binary features in the feature vector correspond to known secret formats (AWS, GitHub, JWT, OpenAI, Slack, database URLs, private key headers, and so on). These fire on known formats with near-zero false positives and form the backbone of high-confidence detections.

Layer 2 — Entropy and character analysis. Shannon entropy, character class ratios, repetition ratio, longest run of repeated characters — these features capture the statistical "shape" of a secret without requiring a specific format match. High entropy combined with a high-risk key name is a strong signal even when no pattern matches.

Layer 3 — Key name risk scoring. The variable name context that neither regex nor entropy captures. This is what allows the classifier to catch password = "simple123" despite its low entropy and lack of a recognisable format.

A finding is reported when the classifier's confidence exceeds a configurable threshold (default: 0.7). Findings include the confidence score, the matched pattern if any, and — for CI/CD integration — an exit code that can gate builds.

What This Actually Catches

I ran the tool against a collection of test cases designed to stress each approach. Results that illustrate the gap:

Caught by all approaches: AWS_KEY = "AKIAIOSFODNN7EXAMPLE" — known format, high entropy, high-risk key name. Every tool gets this.

Caught only by ML: DB_PASS = "Winter2019!" — low entropy, no known format, but the key name DB_PASS scores 1.0 and the classifier flags it at 89% confidence. Regex misses it. Entropy misses it.

False positive in entropy tools, not in ML: expected_hash = "d8e8fca2dc0f896fd7cb4cb0031ba249" — high entropy, but key name scores 0.0 and the ML classifier correctly passes it. A pure entropy scanner flags it; the ML classifier does not.

False positive in regex tools, not in ML: An internal test file with TEST_TOKEN = "fake-token-for-testing" annotated with # secrets-ignore — the suppression annotation is respected, and the low-entropy value combined with a test file context (another feature) keeps the confidence below threshold even without the annotation.

Where This Fits in a Security Programme

A secrets detector — even an ML-powered one — is one layer of a defence-in-depth approach, not a complete solution.

It catches secrets at the point of scanning. It doesn't prevent secrets from being created in the first place (that's developer education and code review). It doesn't rotate compromised credentials (that's incident response). It doesn't enforce secrets management policies (that's your secrets manager — Vault, AWS Secrets Manager, Azure Key Vault).

What it does well: systematically surface secret exposure across a codebase and git history, prevent new secrets from reaching the repository via pre-commit hooks, and provide a measurable baseline for "how many secret exposures exist in our codebase right now."

That baseline matters more than most teams realise — you can't improve what you can't measure.

The full source, including the feature extractor, trainer, and pre-commit hook, is at github.com/pgmpofu/secrets-detector.

Next up: a deep dive into the 26-dimensional feature vector — exactly what the model sees when it evaluates a candidate secret, and how each feature contributes to the final decision.

What Building a SAST Tool Taught Me About AppSec That 13 Years of Software Engineering Didn't

Patience Mpofu — Sat, 09 May 2026 23:16:23 +0000

I've been writing software professionally since 2011.

Java, C#, Kotlin, Node.js. Enterprise backends, microservices, APIs, data pipelines. I've shipped production code that millions of people have used without knowing it. I've led teams, reviewed architectures, mentored junior engineers, and done all the things that accumulate into what people call "senior software engineer."

And yet, when I decided to transition into application security, I realised I had significant blind spots — not about how software works, but about how software fails. Specifically, how it fails in ways that attackers can exploit.

This is the final article in a series about building a SAST scanner from scratch, embedding it in CI/CD pipelines, writing custom detection rules, and managing false positives. But it's really about what that whole process taught me about application security as a discipline — and what I wish I'd understood earlier.

I Knew How to Write Secure Code. I Didn't Know Why It Was Secure.

Here's an embarrassing admission: I've been using parameterised queries for SQL for at least a decade. I knew you were supposed to use them. I used them every time. I would have told you confidently that they prevent SQL injection.

But if you'd asked me, before I started studying AppSec seriously, to explain why they prevent SQL injection — the actual mechanism — I would have given you a hand-wavy answer about "the database handling it separately."

Building the SQL injection detection rule forced me to get precise. I had to understand exactly what makes "SELECT * FROM users WHERE id = " + userId dangerous, what makes SELECT * FROM users WHERE id = ? with a bound parameter safe, and why the difference matters at the level of how the database parses and executes the statement.

The answer — that parameterised queries send the query structure and the data in separate messages, so the database never attempts to parse the data as SQL syntax — is not complicated. But I didn't actually know it at that level of precision until I had to write a rule that distinguishes between the two patterns.

This was a theme throughout the project. I knew the what of secure coding from years of following conventions and best practices. Building detection rules forced me to learn the why — the actual attack mechanics that the conventions are defending against.

The lesson: Knowing the secure pattern is not the same as understanding the vulnerability. For a software engineer, the secure pattern is enough to write safe code. For an AppSec engineer, you need to understand the attack, because your job is to find it when someone else didn't write the safe pattern.

Security Is an Adversarial Discipline

Software engineering is largely a collaborative discipline. You're building something. The goal is for it to work. Your mental model of the system is oriented around the happy path — the flow where inputs are valid, networks are reliable, and users do what you expect.

AppSec is adversarial. The mental shift required is genuinely disorienting at first.

When I was building the JWT algorithm none rule, I had to think like someone who wants to forge authentication tokens. Not because I want to do that, but because unless I understand exactly how the attack works — what the attacker controls, what assumptions the vulnerable code makes, what the exploit chain looks like — I can't write a rule that reliably detects it.

This is the skill that 13 years of software engineering didn't develop: adversarial thinking. The question isn't "does this code do what it's supposed to do?" It's "how could someone make this code do something it's not supposed to do?"

The OWASP Top 10 is, at its core, a catalogue of the assumptions developers make that attackers exploit. A03 — Injection assumes that input is data, not instructions. A07 — Authentication Failures assumes that the code correctly validates identity. A02 — Cryptographic Failures assumes that encryption means the data is protected.

Every category is a place where the developer's mental model of the system diverges from what an attacker can actually do to it. Understanding OWASP deeply means understanding those divergences — not as a checklist, but as a way of thinking.

The lesson: You can't find vulnerabilities you can't imagine. Developing adversarial thinking — the habit of asking "how could this go wrong for someone who wants it to go wrong" — is the most important cognitive shift in the AppSec transition.

Tools Are Amplifiers, Not Answers

Before I built my own SAST tool, I used SAST tools. And I treated them roughly like a compiler warning: something fires, I look at it, I decide whether to fix it or ignore it.

Building one changed how I think about what a SAST tool actually is.

A SAST tool is a codified set of heuristics about what vulnerable code looks like. Those heuristics are written by humans, based on human understanding of vulnerability patterns, with human decisions about confidence levels and severity ratings. The tool doesn't know your codebase. It doesn't know your threat model. It doesn't know whether the finding it just generated is actually exploitable in your specific deployment context.

This sounds like a criticism. It isn't. It's a description of a tool's appropriate role.

When I run Snyk or Semgrep now, I engage with the results differently than I did before. I ask: what pattern is this rule trying to catch? Is that pattern present in my code for the reason the rule assumes? Does the vulnerability the rule targets actually apply in my context? What would an attacker need to control to exploit this?

Those are AppSec questions, not DevOps questions. A DevOps mindset treats SAST output as a compliance gate. An AppSec mindset treats it as a starting point for analysis.

The lesson: A SAST scanner is a signal generator, not an oracle. The value it provides is proportional to the quality of thinking applied to its output — not to the number of findings it generates or suppresses.

False Positives Taught Me About Risk Tolerance

Every time I suppressed a finding in my own scanner, I had to make a decision: is this actually safe, and how confident am I?

That turns out to be the central skill of AppSec: structured risk assessment under uncertainty.

You almost never have complete information. You can't always trace every data flow through a complex system. You can't always know whether a finding is exploitable without building a proof of concept. You have to make a judgment call about whether the risk is acceptable given what you know.

What I learned from managing false positives is that risk tolerance is not a feeling — it's a position that needs to be documented and defensible. "I suppressed this because it looked fine" is not a risk assessment. "I suppressed this because the data being processed is always from our internal configuration system and never from user input, as confirmed by tracing the call stack in lines 42–67" is a risk assessment.

The difference matters when something goes wrong. And in security, things go wrong.

The lesson: Risk assessment is a core AppSec competency, not a soft skill. Developing a structured, documented approach to risk decisions — even informal ones — is more valuable than any specific technical knowledge.

The Gap Between Writing Secure Code and Finding Insecure Code

These are related skills. They are not the same skill.

Writing secure code is a constructive activity. You know what you're building. You apply secure patterns. You follow established conventions. The feedback loop is relatively tight — if you use parameterised queries, you know you're not vulnerable to SQL injection there.

Finding insecure code is a forensic activity. You're examining code you didn't write, often without full context, looking for patterns that indicate vulnerability. The feedback loop is loose — you might flag something, triage it, determine it's a false positive, and never know whether your triage was correct.

The cognitive skills are different. Construction requires knowing the secure pattern. Detection requires knowing the vulnerable pattern and all its variations. It requires understanding which variations are genuinely dangerous and which are contextually safe. It requires maintaining a mental model of an attacker's perspective while reading code that was written from a developer's perspective.

I've spent 13 years getting good at construction. Building this scanner was the first systematic exercise I did in detection. It was harder than I expected — not technically, but cognitively. Shifting from "I'm building this thing to work" to "I'm looking for ways this thing could be exploited" is a genuine gear change.

The lesson: AppSec is not "software engineering plus security knowledge." It's a different cognitive discipline that happens to use the same raw material. Senior software engineers making this transition should expect a genuine learning curve, not just a knowledge gap.

What I'd Tell Someone Starting This Transition

If you're a software engineer moving into AppSec — or considering it — here's what I'd tell you based on this project and the broader transition.

Build something. Reading about OWASP is useful. Reading CVE writeups is useful. Neither teaches you what building a detection rule teaches you. The act of translating "this is a vulnerability" into "this is what the vulnerable code looks like in text" forces a precision of understanding that passive learning doesn't produce.

Study the attacks, not just the defences. Most of your software engineering career was spent learning defences — secure patterns, safe APIs, frameworks that handle the dangerous parts for you. AppSec requires understanding the attacks those defences are designed against. Read exploit writeups. Understand how CVEs actually work. Build your own vulnerable applications and attack them.

Get comfortable with ambiguity. Software engineering has right answers. Does this code compile? Does this test pass? Does this function return the correct value? AppSec often doesn't. Is this finding exploitable? Is this suppression justified? Is this risk acceptable? These questions frequently don't have clean answers, and developing comfort with that ambiguity is part of the transition.

Use your engineering background as a superpower, not a crutch. The thing that makes engineers valuable in AppSec is the ability to read code at scale, understand system architecture, and reason about data flows — skills most pure security professionals develop slowly. Use that. But don't assume that understanding how the code is supposed to work means you understand how it can be broken.

Write about what you're learning. This series started as a way to document my own thinking. Every article forced me to be more precise about something I thought I understood. The act of explaining something to someone else reveals the gaps in your own understanding faster than almost anything else.

Where This Goes Next

Building this scanner and writing this series was one project. The transition is ongoing.

The next project is taking an old Java service and doing something I haven't done yet in this series: running Snyk against a real dependency tree on real legacy code, remediating real CVEs, and measuring the before-and-after security posture with actual metrics.

That's a different kind of AppSec work — Software Composition Analysis rather than static analysis, dependency vulnerabilities rather than code vulnerabilities, Snyk's recommendations rather than my own rules. But the underlying skills are the same: understand the attack, assess the risk, make a defensible decision, measure the outcome.

The transition from software engineer to AppSec engineer is not a destination. It's an ongoing process of developing adversarial thinking, structured risk assessment, and the forensic discipline of finding what's broken rather than building what works.

Thirteen years in, I'm still learning. That's the right state to be in.

The full SAST tool that this series was built around is at github.com/pgmpofu/sast-tool.

If this series was useful to you — or if you're making a similar transition and want to compare notes — I'd genuinely like to hear from you. Find me here on dev.to or connect on LinkedIn.

False Positives in SAST — How I Built Suppression Into My Scanner and Why It Matters

Patience Mpofu — Sat, 09 May 2026 05:02:58 +0000

There's a failure mode that kills security tooling programmes quietly, without drama, and it's not a technical failure.

It's a trust failure.

It goes like this: a team enables a SAST scanner. The scanner fires on 200 things. Engineers triage 40 of them and discover that 25 are false positives. They fix the 15 real findings, suppress the 25 false positives, and then face another 160 findings they haven't looked at yet. Two sprints later, nobody is triaging anymore. The scanner still runs. The reports still generate. Nobody reads them. The security programme is theatre.

False positives are the mechanism by which this happens. Not because developers are lazy — because time is finite and trust is fragile. If a scanner cries wolf enough times, engineers stop listening. That's rational behaviour, not negligence.

This article is about how I thought about false positives when building my SAST tool, what I built to manage them, and why the suppression system design matters as much as the detection rules themselves.

What a False Positive Actually Costs

Before getting into solutions, it's worth being precise about the cost.

A false positive in a SAST scanner costs:

Triage time — an engineer has to read the finding, understand the rule, examine the code in context, and reach a conclusion. Even for an experienced engineer, that's 5–15 minutes per finding for anything non-trivial.
Trust capital — every false positive is a small withdrawal from the trust account between the security team and the engineering team. Trust capital is finite and slow to rebuild.
Attention budget — the more false positives exist, the less attention real findings receive. This is the most dangerous cost. Security is fundamentally an attention allocation problem. A scanner with a 40% false positive rate isn't 40% less useful. It's potentially useless, because the signal-to-noise ratio has collapsed to the point where engineers can't efficiently find real findings among the noise.

The Three Sources of False Positives

Not all false positives are the same. Understanding where they come from determines how to address them.

1. Context-Blind Pattern Matching

This is the most common source in regex-based scanners. The pattern matches the text but doesn't understand what the code is doing.

The MD5 example I've used throughout this series is the canonical case:

# False positive — MD5 for file integrity, not passwords
file_hash = hashlib.md5(file_content).hexdigest()

# True positive — MD5 for password storage
stored_password = hashlib.md5(user_password).hexdigest()

Both lines match the pattern \bmd5\s*\(. Only the second is a vulnerability. A regex scanner cannot tell them apart without understanding the semantic context — what type of data is being hashed.

2. Safe Framework Usage That Looks Dangerous

Some frameworks make inherently dangerous operations safe through abstraction. The dangerous-looking code is actually fine because the framework handles the dangerous part.

// Looks like SQL injection — it's not
// Spring Data JPA with @Query annotation handles parameterisation
@Query("SELECT u FROM User u WHERE u.email = :email")
User findByEmail(@Param("email") String email);

A naive injection rule that flags anything resembling a SQL query with a variable near it would fire here. The JPA annotation system makes this perfectly safe — but the scanner doesn't know that.

3. Test and Configuration Code

Test files are full of patterns that would be alarming in production code:

# test_auth.py
def test_jwt_none_algorithm_rejected():
    # Testing that we correctly REJECT the none algorithm
    malicious_token = jwt.encode({"user": "admin"}, "", algorithm="none")
    response = client.post("/auth", json={"token": malicious_token})
    assert response.status_code == 401  # Should be rejected

This test is doing exactly the right thing — verifying that the application rejects the none algorithm attack. But a scanner looking for algorithm="none" will flag it as AUTHN-001 without understanding that this is a negative test case.

What I Built: The Suppression System

My scanner supports two suppression mechanisms, each designed for different scenarios.

Inline Suppression Annotations

The simplest mechanism: a comment on the same line as the finding tells the scanner to skip it.

file_hash = hashlib.md5(file_content).hexdigest()  # sast-ignore

I support two annotation formats — # sast-ignore and # nosec — because nosec is the Bandit convention and teams coming from Bandit shouldn't have to change their existing annotations.

The scanner checks for these annotations before reporting a finding. If either is present on the matched line, the finding is suppressed silently.

The problem with silent suppression: It's invisible. If every suppression silently disappears from the report, there's no way to audit whether suppressions are legitimate or whether engineers are using them to hide real findings.

Suppression With Justification

The better pattern — and what I recommend teams enforce in code review — is annotating why the suppression is valid:

# MD5 used for file integrity checking only, not credential storage
# Tracked in SEC-REVIEW-2024-041 — confirmed non-sensitive context
file_hash = hashlib.md5(file_content).hexdigest()  # sast-ignore

The annotation still suppresses the finding, but the comment creates a paper trail. When a security audit happens — and it will — every suppression has a documented rationale that a reviewer can evaluate. "We reviewed this and it's fine because X" is defensible. A bare # sast-ignore with no context is not.

The Suppression Inventory in JSON Output

Here's a design decision I'm particularly pleased with: suppressed findings don't disappear from the JSON report. They appear in a separate suppressed_findings array:

{
  "findings": [
    {
      "id": "CRYPTO-002",
      "title": "SHA-1 Usage Detected",
      "severity": "HIGH",
      "file": "src/utils/crypto.py",
      "line": 47
    }
  ],
  "suppressed_findings": [
    {
      "id": "CRYPTO-001",
      "title": "Weak Hashing — MD5",
      "severity": "HIGH",
      "file": "src/utils/file_integrity.py",
      "line": 23,
      "suppression_reason": "MD5 used for file integrity only — sast-ignore"
    }
  ],
  "summary": {
    "total_findings": 1,
    "suppressed": 1,
    "by_severity": { "HIGH": 1 }
  }
}

This means:

The pipeline counts only active findings when deciding whether to fail
The full report shows both active and suppressed findings
Security reviewers can audit suppressions without looking at individual source files
Trend analysis can track suppression rates over time alongside finding rates That last point matters for measuring programme health. If your suppression count is growing faster than your finding count, something is wrong — either your rules are too noisy, or engineers are gaming the system.

Confidence Levels as Pre-Emptive Noise Reduction

The suppression system deals with false positives after they appear. Confidence levels deal with them before.

Every pattern in my rule engine declares a confidence level:

patterns:
  - regex: 'pickle\.loads?\s*\('
    confidence: HIGH     # Almost always a real finding
  - regex: 'unserialize\s*\('
    confidence: MEDIUM   # Real finding in PHP web context, benign in CLI context
  - regex: 'request\.headers\.get\(["\']Origin["\']\)'
    confidence: LOW      # Could be proper allowlist implementation

Confidence levels serve two purposes.

For engineers reading findings: Confidence communicates how much manual review a finding deserves. A HIGH confidence finding deserves immediate attention. A LOW confidence finding is a prompt to look at the code and make a judgment call. Without this signal, every finding looks equally important — which means either everything gets treated as urgent (unsustainable) or everything gets triaged with the same low attention (misses real issues).

For pipeline configuration: Teams can configure their build gate to fail only on findings above a confidence threshold:

# Fail on HIGH severity + HIGH confidence only
python main.py ./src --fail-on HIGH --min-confidence HIGH

# See everything including LOW confidence findings in audit mode
python main.py ./src --fail-on none --min-confidence LOW

This is a more nuanced gate than severity alone. A MEDIUM severity finding with HIGH confidence (this is almost certainly real, and it's moderately serious) might warrant blocking. A HIGH severity finding with LOW confidence (this is probably bad, but it might be fine) might not. The two dimensions together give you much more precise control over your signal-to-noise ratio.

The Suppression Review Process

The suppression mechanism is only as good as the governance around it. A suppression system without a review process is just a way to silence the scanner faster.

Here's the process I'd implement in a team setting:

Step 1 — Developer identifies a finding they believe is a false positive.
They don't suppress it immediately. They raise it in the PR for discussion.

Step 2 — The team reviews the claim.
Is the developer's reasoning sound? Is the code actually safe in context? Does anyone have concerns? This is a two-minute conversation in most cases, not a security committee meeting.

Step 3 — If accepted, the suppression is added with justification.
The # sast-ignore goes in with a comment explaining why. The suppression is visible in the PR diff — it can't be hidden.

Step 4 — The suppression is tracked.
In the JSON report, in a suppression registry spreadsheet, or in a dedicated Notion page — wherever works for your team. What matters is that someone periodically reviews the suppression inventory and asks: are these still valid?

Step 5 — Periodic suppression review.
Suppressions rot. Code changes. The context that made a suppression valid six months ago may no longer apply. A quarterly review of active suppressions — not of the whole codebase, just the suppression inventory — keeps the list honest.

Tuning Rules to Reduce Systemic False Positives

When a specific rule consistently generates false positives across the codebase, the right answer isn't to suppress every instance — it's to tune the rule.

The MD5 rule is a good example. Rather than flagging every md5( call at HIGH confidence, I could tighten the pattern to focus on contexts that suggest credential handling:

Before (noisy):

patterns:
  - regex: '\bmd5\s*\('
    confidence: HIGH

After (tighter):

patterns:
  - regex: 'md5\s*\(\s*(password|passwd|pwd|secret|credential|token)'
    confidence: HIGH
  - regex: '(password|passwd|pwd)\s*=\s*.*md5\s*\('
    confidence: HIGH
  - regex: '\bmd5\s*\('
    confidence: LOW   # Generic usage — review context

Now the rule distinguishes between MD5 in credential contexts (HIGH confidence, almost certainly a problem) and generic MD5 usage (LOW confidence, warrants a look but probably fine). The total finding count might be the same, but the actionable finding count — the ones that genuinely require a fix — goes up as a proportion of the total.

This is the most sustainable way to reduce false positives: better rules, not more suppressions.

The False Negative Trade-off

Every time you tune a rule to reduce false positives, you risk introducing false negatives — real vulnerabilities the scanner no longer catches.

This is the fundamental tension in SAST tool design. It has no clean resolution. It only has a deliberate choice.

If you tighten the MD5 rule to only flag credential contexts, you'll miss the case where a developer uses a custom variable name:

# Now invisible to the tightened rule
user_auth_hash = hashlib.md5(user_password).hexdigest()

The question is: which failure mode is more expensive for your specific context?

If your team is diligent about triage and the cost of a false negative (missed vulnerability) is high — financial services, healthcare, anything with regulatory consequences — keep rules broader and invest in the triage process.

If your team is drowning in noise and findings aren't getting triaged at all — the scanner has already effectively failed — tighten the rules to rebuild trust, accept the trade-off, and plan to layer in additional controls elsewhere.

There's no universally correct answer. There's only an honest assessment of your specific situation.

What a Healthy Suppression Profile Looks Like

After a few months of running the scanner with a consistent process, here's what healthy metrics look like:

Suppression rate below 20%. If more than 1 in 5 findings is being suppressed, your rules are too noisy for your codebase. Tune the rules rather than suppressing everything.

No suppressions without justification comments. Bare # sast-ignore annotations with no explanation are a red flag. Make justification comments a code review requirement.

Suppression inventory reviewed quarterly. Old suppressions that are no longer valid are silent technical debt. A quarterly review catches them.

False positive rate declining over time. As you tune rules based on real-world results, your false positive rate should go down. If it's stable or increasing, you're not learning from your suppression data.

New findings triaged within one sprint. If findings from a scan are still unreviewed after two weeks, your triage process isn't keeping up. Either reduce the finding volume (tune rules) or increase triage capacity.

The Bigger Point

False positive management is not a technical problem. It's a trust and process problem that has technical levers.

The suppression system in my scanner — inline annotations, justification comments, suppressed findings in the JSON output, confidence levels on patterns — these are all technical levers. But they only work in the context of a team that has agreed on how to use them.

The best SAST implementation I can imagine is one where:

Engineers trust the scanner because it has a low false positive rate
The scanner trusts engineers because suppressions are reviewed and justified
Security teams trust both because the suppression inventory is auditable and periodically reviewed That's not a configuration. That's a culture. The configuration just makes the culture possible.

Full source and suppression documentation at github.com/pgmpofu/sast-tool.

Next up — the final article in this series: what building all of this taught me about application security that 13 years of software engineering didn't.

The Adoption Trap to Avoid

Patience Mpofu — Thu, 07 May 2026 18:41:12 +0000

The single biggest mistake teams make with CI/CD-integrated security tooling is treating it as a one-time setup rather than an ongoing programme.

The scanner is not the security programme. The scanner is a signal generator. The security programme is the process by which signals become fixes, fixes become patterns, and patterns become rules that prevent the same issue from appearing again.

Configurable thresholds give you the controls to introduce that programme without breaking your team's deployment workflow. Use them gradually, communicate the reasoning at each phase, and invest as much in the suppression review process as you do in the initial setup.

A scanner your team trusts and engages with is worth ten scanners that get bypassed.

Full source and GitHub Actions workflow examples at github.com/pgmpofu/sast-tool.

Next up: the one everyone's been asking about — false positives in SAST, how I built suppression into the scanner, and why managing false positives is as important as finding real vulnerabilities.