Patience Mpofu

Posted on May 16

I Ran My ML Secrets Detector Against My Own Repositories — Here's What It Found

#security #python #secrets #detector

here's a moment every security tool builder eventually faces.

You've built the scanner. You've written the rules. You've validated it against synthetic test cases and contrived examples. And then you point it at your own code — the repositories you've actually written, committed, and pushed over years of real development work.

That moment is humbling.

I ran my ML secrets detector against every personal repository I own — 11 repositories across Python, Java, Node.js, and Kotlin projects accumulated over several years of portfolio building and side projects. I'm documenting the results honestly: what it found, what was real, what was a false positive, and what the numbers actually looked like.

The Setup

Before running, I configured the scan for comprehensive coverage:

# Full repository scan including git history
python main.py scan ./repos/ \
  --include-history \
  --threshold 0.65 \
  --format all \
  --output ./scan-results/

A threshold of 0.65 rather than the default 0.70 — I wanted to see more findings, including ones that would normally sit just below the reporting threshold. For an audit of your own code, more signal is better than less.

The --include-history flag scans not just the current working tree but every commit in git history. This is the mode that makes people nervous. Whatever got committed and "fixed" later is still in the history. It's still accessible. It still needs to be addressed.

Repositories scanned: 11

Total commits scanned: 847

Total files scanned: 2,341

Scan duration: 4 minutes 23 seconds

The Raw Numbers

Severity	Findings	Confirmed Real	False Positives	False Positive Rate
CRITICAL	7	6	1	14%
HIGH	19	11	8	42%
MEDIUM	31	9	22	71%
Total	57	26	31	54%

A few things to unpack here.

The CRITICAL findings had a 14% false positive rate — one in seven was benign. That's roughly what I expected based on the test set results. The one false positive was a 32-character hex string in a variable named encryption_mode — the word "encryption" pushed the key name score high, but the value was actually a configuration mode identifier, not a key.

The HIGH findings had a 42% false positive rate. Higher than I'd like, but consistent with the nature of HIGH confidence findings — they're cases where the evidence is strong but not overwhelming. Most of the false positives in this tier were package integrity hashes in older package-lock.json files that hadn't been added to the skip list yet.

The MEDIUM findings had a 71% false positive rate. This is expected and by design. MEDIUM findings are prompts for human review, not automatic defects. Most were generic high-entropy strings in configuration files where the variable names were moderately suspicious but the values were benign.

The overall 54% false positive rate sounds alarming until you account for the lower threshold (0.65 vs. default 0.70) and the MEDIUM tier. At the default threshold, the false positive rate drops to approximately 28% — closer to the test set results.

The Real Findings: What Was Actually There

Of the 26 confirmed real findings, here's what they were. I've anonymised the specific values but documented the pattern honestly.

Finding 1–3: Test Credentials That Never Left Test Files (But Were Still Committed)

Three findings were test database credentials in integration test configuration files:

# tests/integration/test_database.py (2021 commit)
TEST_DB_PASSWORD = "integration_test_password_2021"
TEST_DB_URL = "postgresql://testuser:local_test_pass@localhost/testdb"

These were intentionally "fake" credentials — values I created specifically for local testing. But they were committed to a public repository. The classifier flagged them at 87% and 91% confidence respectively.

Are these real vulnerabilities? Technically no — a local test database password with no external access isn't a secret in the traditional sense. But they taught me something: even intentional test credentials get flagged, which means either the suppression annotation should have been there from the start, or the test configuration should have used environment variables even for local test values.

The lesson isn't that the scanner was wrong. It's that "this is only for testing" is not a reason to skip secure credential handling.

Finding 4: An Actual JWT Secret (History)

This one made my stomach drop.

CRITICAL (97%) · src/auth/config.py:23
jwt_secret = "my-jwt-signing-secret-change-this"
↳ History: commit a3f8b2c · 2020-03-14

I found a hardcoded JWT signing secret in a 2020 commit to a project that I had since "fixed" by moving to environment variables. The fix was in the current code. The secret was still in git history.

The value itself — "my-jwt-signing-secret-change-this" — is one of those values that developers write with the intention of replacing it before going anywhere near production. The comment is literally in the name. But it got committed, and committed things live in git history forever unless you rewrite it.

The project was never deployed to production with this value. But it was a public repository. Anyone who cloned it at any point in 2020 has this value. The theoretical attack surface was real even if the practical exploitation probability was low.

What I did: Rewrote the commit history using git filter-branch to remove the file containing the secret, then force-pushed. I also added a .gitignore entry for config.py files and a pre-commit hook (obviously) to catch this pattern in future.

Finding 5–8: API Keys in Old Test Scripts

Four findings were API keys in utility scripts I'd written to test integrations:

# scripts/test_sendgrid.py (2019 commit)
SENDGRID_API_KEY = "SG.abc123...xyz789"  # key has been rotated

These were real API keys at the time of commit. I confirmed with the respective providers that all four had been rotated or the accounts had been closed — so the operational risk was zero. But they were real keys that were real secrets when committed.

This is the most common pattern in real credential exposure incidents: keys that were live at the time of commit, rotated after discovery, but remain in history as evidence of the exposure. The key rotation closes the operational risk but doesn't erase the fact of the exposure.

What I did: Rotated anything still active (none were), documented the historical exposure, and rewrote history for the two repositories where the keys were in active-looking scripts. For older repositories where the scripts were clearly abandoned, I left the history intact and noted the exposure in the repository README.

Finding 9–11: Internal Service URLs With Embedded Credentials

Three findings were database and service connection strings:

# config/database.py (2022 commit)
DATABASE_URL = "postgresql://admin:password123@internal-host:5432/appdb"

None of these were production credentials — they were development environment connection strings pointing to local or development hosts. But the pattern is exactly what you see in production credential exposures, and the scanner correctly identified them as high confidence.

Two were for hosts that no longer exist. One was for a development Postgres instance that still exists but has no external network access. The operational risk was low; the pattern risk was real.

Finding 12: A Private Key Fragment in a README

The most surprising finding:

CRITICAL (99%) · README.md:47
-----BEGIN RSA PRIVATE KEY-----
MIIEowIBAAKCAQEA...

A README containing an example private key that I'd generated specifically to demonstrate what a private key looks like in documentation. It was a real RSA private key — not a truncated fake — but generated purely for documentation purposes and never associated with any system.

The scanner correctly flagged it. The private key has never been used for anything. But it's a valid RSA private key that anyone could theoretically use to claim they found something in my repository.

What I did: Replaced the real private key in the README with a clearly truncated fake:

-----BEGIN RSA PRIVATE KEY-----
MIIEowIBAAKCAQEA[EXAMPLE - NOT A REAL KEY]...
-----END RSA PRIVATE KEY-----

If you're writing documentation that shows what a private key looks like, never use a real generated key. Generate a fake-looking placeholder instead.

Findings 13–26: Various Confirmed Vulnerabilities

The remaining 14 confirmed findings were a mix of:

Hardcoded passwords in older Java projects using Spring with properties files committed directly
OAuth client secrets in mobile app prototype code from 2018–2019
Slack webhook URLs (which are effectively secrets — anyone with the URL can post to your channel)
Internal service tokens from a project that has since been decommissioned All were historical, all have been rotated or decommissioned. All are now either suppressed with justification or removed from history.

The False Positives: What Triggered Them

The 31 false positives clustered into four categories:

Category 1: Package Lock File Hashes (12 findings)

The most numerous false positive source. package-lock.json files contain SHA-512 integrity hashes for every dependency:

"integrity": "sha512-abc123def456..."

These are high-entropy strings in a file that often has keys named integrity. The key name risk for "integrity" is 0.0 in my vocabulary, which should push these below threshold — and at the default 0.70 threshold, most don't appear. At 0.65, several edge cases squeaked through.

Fix: Added package-lock.json, yarn.lock, and *.lock to the global skip list.

Category 2: UUID Values With Moderately Sensitive Variable Names (8 findings)

session_token = "550e8400-e29b-41d4-a716-446655440000"
auth_correlation_id = "7c9b2de1-3f4a-8b5c-2d1e-9f8a7b6c5d4e"

"Session token" and "auth correlation ID" score moderately high on key name risk. UUIDs have moderate entropy. The combination pushed these above 0.65.

Fix: Added correlation_id, session_id, request_id, and similar terms to the explicitly benign vocabulary with a score of 0.0.

Category 3: Example Values in Documentation (7 findings)

Markdown files and READMEs containing example code snippets:

Set your API key:

python
API_KEY = "your-api-key-here"

python

"your-api-key-here" is low entropy and obviously a placeholder. The scanner correctly passes it. But other examples used more realistic-looking values:

API_KEY = "aK9mP2xL8vR3qT7nY5wZ1bJ4cH6dF0eI"

The variable name is high risk, the entropy is high, and no pattern matches — 78% confidence. False positive, but an understandable one.

Fix: Added .md and .rst files to a lower-confidence mode (threshold raised to 0.90 for documentation files) rather than skipping them entirely — real secrets do appear in committed documentation.

Category 4: High-Entropy Configuration Values (4 findings)

Configuration values that are long and random-looking but aren't secrets:

CACHE_KEY_PREFIX = "app_v2_prod_cache_2024_r3f8b2"
CORRELATION_HEADER = "X-Request-ID-v2-production-shard-3"

These are deterministic, human-readable configuration values that happen to be long and contain alphanumeric characters. Low false positive risk in most codebases but they appeared in mine.

Fix: These are the hardest category to address systematically. The suppression annotation is the right tool — add # secrets-ignore with a note that the value is a configuration constant.

What the History Scan Revealed That the Current Scan Didn't

Scanning history found 9 findings that don't appear in the current codebase — secrets that have been "fixed" but remain in git history. This is the most important capability of the history scanner and the most overlooked.

Findings in current code: 17
Findings only in history: 9
Total unique findings: 26

The 9 historical-only findings represent credentials that a developer committed, noticed (or was told about), and removed from the current code — but never removed from history. From a security perspective, these are live exposures. The credential exists in a public repository's history. Anyone who cloned the repository at any point has it.

The remediation for historical findings is harder than current findings:

Option 1: Rotate the credential. If the credential is still active, rotate it immediately. The historical exposure is already done — rotation closes the operational risk.

Option 2: Rewrite git history. Using git filter-branch or the newer git filter-repo, you can rewrite history to remove the file or commit containing the secret. This requires force-pushing, which is disruptive if other people have cloned the repository.

Option 3: Make the repository private. If the repository is public and the historical exposure is significant, making it private while history is cleaned up is a reasonable interim step.

Option 4: Document and accept. For decommissioned systems and rotated credentials with no active risk, documenting the historical exposure in the repository README and marking the findings as suppressed is acceptable. Not ideal, but pragmatic for old secrets with no active attack surface.

The Honest Assessment

Running the scanner against my own repositories was a genuinely useful exercise that I'd recommend to anyone building security tooling.

What worked well:

CRITICAL findings were high precision — 6 out of 7 were real
The history scanner found things I'd genuinely forgotten about
The scan was fast enough that 11 repositories in 4 minutes felt reasonable
The output was actionable — I knew exactly what to fix and where What needs improvement:
The HIGH finding false positive rate of 42% is too high for a production tool targeting real organisations. It would erode trust in a team context
The package-lock.json skip list should have been in place from the start — that's a known false positive source that I didn't anticipate fully
The threshold calibration needs work — 0.70 feels too conservative for CRITICAL findings and not conservative enough for HIGH findings The finding that most surprised me: The JWT secret in history. Not because finding it surprised me — that's exactly what the history scanner is for. Because I had genuinely forgotten it was there. I "fixed" the issue in 2020 by moving to environment variables and closed the mental file. The history scanner reopened it.

That's the value proposition of history scanning in one sentence: it finds the things you fixed but didn't actually fix.

What to Do If You Want to Run This Against Your Own Repos

Start with current code only, at the default threshold:

python main.py scan ./your-repo --threshold 0.70 --format terminal

Triage every CRITICAL finding before looking at anything else. Then work through HIGH. Treat MEDIUM as informational unless something catches your eye.

Once you've cleaned up the current state, run the history scan:

python main.py scan ./your-repo --include-history --threshold 0.70

Be prepared for findings you've forgotten about. Have a decision framework ready for each one: rotate, rewrite history, or document and accept.

The scan itself is the easy part. The remediation decisions are where the real work is.

The full tool, including the history scanner and all configuration options, is at github.com/pgmpofu/secrets-detector.

If you run it against your own repositories and find something interesting — or find a false positive pattern I haven't handled — open an issue. The tool gets better from real-world feedback, and real-world feedback only comes from people running it on real code.

DEV Community