Vasishta Nandipati

Posted on May 29

I Built a Secret Scanner That Checks Your Git History, Not Just Your Code

#security #python #opensource #devops

Most developers know they shouldn't commit API keys. Most secret scanners will catch an AWS key sitting in your current codebase. What they won't catch is the key you deleted three commits ago -- which is still fully recoverable by anyone who clones your repo and runs git log -p.

That gap is what I built leakscan to address.

The Problem With Current-State-Only Scanners

When you delete a secret from a file and commit, the removal is recorded in git history. But the original commit that introduced the secret is still there. Every clone of your repository carries that history. Anyone -- a future contributor, a malicious actor, a job applicant reviewing your public code -- can recover those secrets.

# This recovers secrets you "deleted" months ago
git log -p | grep -A2 "AKIA\|sk-\|ghp_"

Most scanners only look at your working tree. leakscan traverses every commit.

What leakscan Does

leakscan is a Python CLI that scans for leaked secrets across:

Local file trees (parallel, 8 threads)
Full git history across any branch
Public GitHub repos by URL
All repos and gists for a GitHub user or org

It ships with 55+ regex patterns covering AWS, GitHub, GitLab, Stripe, OpenAI, Anthropic, Slack, Twilio, Discord, Telegram, npm, PyPI, and more. On top of regex, it runs Shannon entropy scoring on .env, YAML, and INI files to catch high-entropy values that don't match a known pattern.

Shannon Entropy: Catching the Unknowns

Not every leaked secret follows a known format. A randomly generated 32-character database password won't match any regex. Shannon entropy measures the randomness of a string -- secrets tend to have high entropy because they're generated to be unpredictable.

The entropy scorer in leakscan is scoped to value-bearing lines in config files, not general source code, to keep the false positive rate low. You can disable it with --no-entropy if you're scanning code that has intentionally high-entropy strings (e.g., compiled output).

Live Verification

Finding a secret is only half the picture. leakscan can verify whether a found secret is still active by making a live API call:

secrets scan . --verify

Currently supports: GitHub, GitLab, Stripe, OpenAI, Anthropic, HuggingFace, SendGrid, Slack, npm, Replicate.

A revoked or rotated secret shows as INACTIVE in the output. This matters in triage -- you want to know if you have an active exposure or just a historical artifact.

CI/CD Integration

The tool is built to run in pipelines without manual configuration.

# GitHub Actions
- name: Scan for secrets
  run: secrets scan . --severity HIGH --no-entropy --format sarif --output results.sarif

- name: Upload SARIF
  uses: github/codeql-action/upload-sarif@v3
  with:
    sarif_file: results.sarif

Exit code 1 on any CRITICAL or HIGH finding, so the build fails automatically. SARIF output integrates with the GitHub Security tab and GitLab SAST.

Baseline mode handles the "known findings" problem in CI:

# First run: save current state
secrets scan . --save-baseline .secrets.baseline

# Subsequent runs: only alert on NEW secrets
secrets scan . --baseline .secrets.baseline

This stops CI from constantly alerting on findings you've already triaged and accepted (test fixtures, example configs with placeholder values, etc.).

Pre-commit Hook

cd your-git-repo
secrets install-hook

The hook runs on every commit and uses the baseline automatically if present. Inline suppression is supported: add # nosec, # gitleaks:allow, or # secretscanner:allow to any line to skip it.

Output Formats

Format	Use case
Terminal (default)	Interactive review with severity colors
JSON	Programmatic consumption, SIEM ingestion
CSV	Spreadsheet review, audit exports
SARIF 2.1.0	GitHub Security tab, GitLab SAST
Markdown	Disclosure reports to security teams

Architecture

The codebase is intentionally modular:

scanner/
cli.py        entry point (click)
engine.py     file walker, parallel scanner, git history
patterns.py   55+ regex patterns
entropy.py    Shannon entropy scorer
verifier.py   live API verification (10 services)
baseline.py   save/load/compare baseline fingerprints
reporter.py   terminal/JSON/CSV/SARIF/disclosure output
ignorefile.py .secretignore parser with ** glob support
github/
fetcher.py  GitHub API client: repos, gists, commit history

Each module is independently testable. The full pytest suite is in /tests.

Installation

pip install leakscan

Where This Fits vs. Existing Tools

Tools like Gitleaks, Detect-Secrets, and TruffleHog are excellent. leakscan is a Python-native alternative with a focus on Git history scanning, live verification, and baseline-aware CI. If your team is already Python-heavy, a pip install is a lower-friction entry point than distributing a Go binary.

What's Next

Expanded verifier coverage (Twilio, Mailchimp, Shopify)
GitHub Actions marketplace action
PyPI download metrics and badge

The repo is at github.com/Vasishta03/secret-scanner. Contributions, pattern additions, and feedback welcome.

DEV Community