Patience Mpofu

Posted on May 16

Blocking Secrets Before They Hit the Repository: Building a Pre-Commit Hook With ML

#security #git #devops #python

here are two places you can catch an exposed secret.

After it's in the repository — in a CI/CD pipeline scan, a periodic audit, or a breach notification from a security researcher who found it in your public history. Or before it ever gets there — at the moment of git commit, when the developer is still at their keyboard and the fix takes thirty seconds.

The second option is better in every dimension. Earlier detection means lower remediation cost. A blocked commit means no credential rotation required, no incident response, no git history rewriting. The developer who gets stopped at commit understands immediately what they did and why — the context is fresh, the fix is obvious.

The challenge is UX.

A pre-commit hook that's too slow gets disabled. A hook that generates too many false positives gets disabled. A hook that doesn't explain itself gets disabled and complained about on Slack. A hook that developers trust — that's fast, precise, and tells them exactly what it found and why — stays enabled and actually prevents exposures.

This article is about building a pre-commit hook that developers will actually leave on.

What the Hook Needs to Do

Before writing a line of code, I defined what a good pre-commit secrets hook looks like from the developer's perspective.

Speed. The hook runs on every commit. If it adds more than two or three seconds, developers will notice and resent it. On a typical feature branch with a handful of changed files, the scan needs to complete in under two seconds.

Scope. The hook should scan staged content — only the files about to be committed — not the entire repository. Scanning everything on every commit is unnecessary and slow.

Signal clarity. When the hook blocks a commit, the developer needs to know immediately: which file, which line, what variable, why it was flagged. "Secret detected" with no context is useless. "HIGH confidence (94%): api_key = "sk-proj-abc123..." in config/settings.py line 47 — matches OpenAI key format" is actionable.

Suppression path. Developers need a documented, low-friction way to handle false positives. The hook can't be a hard wall with no escape — that's how hooks get disabled entirely.

Non-destructive. The hook never modifies files. It either passes silently or blocks and explains. That's it.

Architecture: Scanning Staged Content

The first architectural decision is what to scan. There are two options:

Option A: Scan the working tree — the files as they currently exist on disk, including unstaged changes.

Option B: Scan the staged content — exactly what git diff --cached shows, which is what will actually be committed.

Option B is correct. Scanning the working tree means flagging things the developer hasn't committed and may never intend to commit. That's noise. Scanning staged content means flagging exactly what's about to enter the repository — which is the precise intervention point.

def get_staged_content() -> dict[str, str]:
    """Get the staged content for all modified/added files."""
    staged_files = {}

    # Get list of staged files
    result = subprocess.run(
        ["git", "diff", "--cached", "--name-only", "--diff-filter=ACM"],
        capture_output=True, text=True
    )

    filenames = result.stdout.strip().split('\n')

    for filename in filenames:
        if not filename:
            continue

        # Get staged content (not working tree content)
        content_result = subprocess.run(
            ["git", "show", f":{filename}"],
            capture_output=True, text=True
        )

        if content_result.returncode == 0:
            staged_files[filename] = content_result.content_result.stdout

    return staged_files

The --diff-filter=ACM flag limits to Added, Copied, and Modified files — not deletions. Scanning deleted file content would generate findings for secrets that are being removed, which is the wrong direction.

The Scan Loop: From Staged Content to Findings

The hook extracts string literal assignments from each staged file and passes them through the ML classifier:

def scan_staged_files(staged_content: dict[str, str], threshold: float = 0.7):
    findings = []

    for filepath, content in staged_content.items():
        # Skip binary files, lock files, and known safe extensions
        if should_skip_file(filepath):
            continue

        lines = content.split('\n')

        for line_num, line in enumerate(lines, 1):
            # Skip lines with suppression annotation
            if '# secrets-ignore' in line or '# nosec' in line:
                continue

            # Extract (key_name, value) pairs from string assignments
            assignments = extract_string_assignments(line)

            for key_name, value in assignments:
                if len(value) < 8:  # Skip very short strings
                    continue

                features = extract_features(value, key_name)
                confidence = model.predict_proba([features])[0][1]

                if confidence >= threshold:
                    findings.append({
                        "file": filepath,
                        "line": line_num,
                        "key_name": key_name,
                        "value_preview": value[:20] + "..." if len(value) > 20 else value,
                        "confidence": confidence,
                        "severity": confidence_to_severity(confidence)
                    })

    return findings

A few implementation details worth highlighting:

should_skip_file() excludes file types that generate systematic false positives: package-lock.json, yarn.lock, *.sum (Go module checksums), *.min.js (minified JavaScript), binary file extensions, and image files. These are maintained in a skip list rather than being hardcoded into the scan logic, so teams can extend it for their specific false positive patterns.

Value preview truncation. The finding reports only the first 20 characters of the flagged value, with ... truncation. Showing the full value in terminal output creates a secondary exposure — if someone is screen sharing when the hook fires, the secret shouldn't appear in full in the terminal.

Minimum length of 8. Strings shorter than 8 characters are almost never secrets. This eliminates a class of false positives from short configuration values and reduces scan time on files with many string literals.

The Output: Making Findings Actionable

The most important UX decision in the hook is what to show when a finding is blocked. I went through four iterations of the output format before settling on one that developers responded well to.

Iteration 1 (too terse):

BLOCKED: Secret detected in config/settings.py

Developers immediately asked: "What secret? Where exactly? What should I do?"

Iteration 2 (better but still vague):

BLOCKED: Possible secret at config/settings.py:47

Still not enough context. Developers had to open the file and count to line 47 to understand what was flagged.

Iteration 3 (too verbose):

[SECRETS DETECTOR] 
==========================================
COMMIT BLOCKED — POTENTIAL SECRET DETECTED
==========================================
File: config/settings.py
Line: 47
Variable: api_key
Value (truncated): sk-proj-abc123...
Confidence: 94%
Severity: CRITICAL
Matched Pattern: OpenAI API key format (sk-proj-*)
Feature contributions:
  - key_name_risk: 0.90 (HIGH)
  - shannon_entropy: 5.82 (HIGH)
  - pattern_openai_key: 1.00 (MATCH)
  - repetition_ratio: 0.94 (HIGH)

To suppress this finding, add '# secrets-ignore' to line 47
To bypass this check entirely (NOT RECOMMENDED): git commit --no-verify
==========================================

This is technically complete but overwhelming. Developers in flow state don't want to read a report. They want to know: what, where, what to do.

Final version (what shipped):

🔴 Secrets Detector — Commit Blocked

  CRITICAL (94%) · config/settings.py:47
  api_key = "sk-proj-abc123..."
  ↳ Matches OpenAI key format · High entropy · Sensitive variable name

  To suppress false positive: add  # secrets-ignore  to line 47
  To use env vars instead:    export API_KEY="your-key"
                              then  api_key = os.environ["API_KEY"]

1 finding blocked this commit. Fix the issue or suppress with justification.

The final format answers the three questions developers actually have in two seconds of reading: what is it (OpenAI key), where is it (file and line), what do I do (env var example or suppression). The feature contributions are available in verbose mode (--verbose) but don't appear by default.

The emoji is intentional. 🔴 provides an immediate visual signal in terminals that support it, and degrades gracefully to plain text in terminals that don't.

Handling Multiple Findings

When multiple findings exist, the output stacks them:

🔴 Secrets Detector — Commit Blocked

  CRITICAL (96%) · src/database.py:12
  DB_PASSWORD = "Tr0ub4dor&3"
  ↳ High-risk variable name · Matches human-chosen password pattern

  HIGH (78%) · src/config.py:34
  internal_token = "prod-service-backend-2019"
  ↳ Moderate-risk variable name · Low entropy but sensitive context

2 findings blocked this commit. Fix all issues before committing.

Findings are sorted by confidence descending — the most certain findings appear first, which is where the developer's attention should go.

The commit is blocked if any finding exceeds the threshold, not just the highest-confidence one. A batch of MEDIUM confidence findings is still a blocked commit. If all findings are genuine false positives, they should all be suppressed with justification — not just the top one.

The Suppression UX

The suppression path needs to be low-friction but not invisible. If suppressing a false positive is too hard, developers will use git commit --no-verify to bypass the hook entirely — which defeats the purpose.

The designed flow:

# Developer encounters a false positive:
# file_integrity_hash = "d8e8fca2dc0f896fd7cb4cb0031ba249"  ← flagged

# They add the annotation with a justification:
# MD5 hash for file integrity check only — not a credential
file_integrity_hash = "d8e8fca2dc0f896fd7cb4cb0031ba249"  # secrets-ignore

# Commit proceeds normally on next attempt

The # secrets-ignore annotation is visible in code review. A reviewer can see that a suppression was added and evaluate whether the justification is reasonable. This is the governance layer — suppressions can't happen silently.

The hook also respects the SECRETS_DETECTOR_THRESHOLD environment variable, which allows individual developers to adjust their personal threshold without modifying shared configuration:

# Developer who wants to see more findings (lower threshold)
SECRETS_DETECTOR_THRESHOLD=0.55 git commit -m "wip"

# Developer who wants fewer false positives (higher threshold)
SECRETS_DETECTOR_THRESHOLD=0.85 git commit -m "feature: payment flow"

This flexibility matters for adoption. Some developers will want to see everything; others will want a tighter filter. Forcing everyone to the same threshold is a source of friction.

Installation: Making Setup Frictionless

A hook that's hard to install never gets installed. The setup needs to be one command:

# Using pre-commit framework (recommended)
pip install pre-commit
echo "repos:
- repo: https://github.com/pgmpofu/secrets-detector
  rev: v1.0.0
  hooks:
  - id: secrets-detector
    args: [--threshold, '0.7']" > .pre-commit-config.yaml
pre-commit install

Or manual installation for teams not using the pre-commit framework:

# Copy hook to git hooks directory
cp hooks/pre-commit .git/hooks/pre-commit
chmod +x .git/hooks/pre-commit

The pre-commit framework approach is preferable for teams because it version-pins the hook, makes it part of the repository configuration (.pre-commit-config.yaml is committed), and automatically installs on git clone for new team members. The manual approach works for individual use.

What Happens at `git commit --no-verify`

This is the escape hatch that can't be removed. Git's --no-verify flag bypasses all hooks, and there's nothing a hook can do to prevent it.

The right response to this is not technical — it's cultural and procedural.

In a team setting, git commit --no-verify should require a comment in the commit message explaining why the hook was bypassed. This can be enforced through CI/CD: a pipeline step that checks whether any commit in a PR used --no-verify and requires a justification in the commit message if so.

# In GitHub Actions
- name: Check for hook bypasses
  run: |
    git log --oneline origin/main..HEAD | while read line; do
      hash=$(echo $line | cut -d' ' -f1)
      msg=$(git log --format=%B -n 1 $hash)
      if git log --format=%B -n 1 $hash | grep -q "no-verify bypass"; then
        echo "Documented bypass found in $hash"
      fi
    done

The goal is to make --no-verify traceable, not to make it impossible. A developer in a genuine emergency who needs to commit right now and deal with the secret later should be able to do that — but there should be a record of the decision.

Measuring Hook Effectiveness

After the hook has been running for a few weeks, three metrics tell you whether it's working:

Bypass rate. What percentage of commits use --no-verify? A bypass rate above 10% suggests the hook is generating too many false positives or too much friction. Investigate which developers are bypassing most frequently and why.

Suppression rate. What percentage of findings are suppressed rather than fixed? High suppression rates indicate either noisy rules or developers treating suppression as the default response. Review suppressions in code review and push back on suppression-without-justification.

Secrets found in CI despite the hook. If your CI pipeline also runs a secrets scan and finds things the pre-commit hook didn't catch, those are false negatives worth understanding. Each one is an opportunity to improve the hook's coverage.

The hook is not a complete solution — it's the first line of defence. CI scanning is the second. Periodic full history scanning is the third. Each layer catches what the previous one misses.

The Broader Point: Shift Left Has a UX Requirement

"Shift left" — catching security issues earlier in the development lifecycle — is the right strategy. Every study on the economics of security defects confirms that earlier detection means lower remediation cost.

But shift left only works if the shifted controls are actually used. A pre-commit hook that developers disable after the first false positive has shifted nothing. A CI gate that gets bypassed in every release has shifted nothing.

The investment in UX — the careful output format, the clear suppression path, the fast scan, the explainable findings — is not cosmetic. It's what determines whether the security control actually operates or sits dormant in the repository while credentials quietly accumulate in git history.

Security controls that developers trust are security controls that get used. That's the only metric that matters.

The pre-commit hook implementation is in hooks/pre-commit at github.com/pgmpofu/secrets-detector.

Last article in the series: I ran the secrets detector against my own repositories — here's what it actually found, the false positives I encountered, and what the real-world numbers looked like.

DEV Community

Blocking Secrets Before They Hit the Repository: Building a Pre-Commit Hook With ML

What the Hook Needs to Do

Architecture: Scanning Staged Content

The Scan Loop: From Staged Content to Findings

The Output: Making Findings Actionable

Handling Multiple Findings

The Suppression UX

Installation: Making Setup Frictionless

What Happens at `git commit --no-verify`

Measuring Hook Effectiveness

The Broader Point: Shift Left Has a UX Requirement

Top comments (0)

What the Hook Needs to Do

Architecture: Scanning Staged Content

The Scan Loop: From Staged Content to Findings

The Output: Making Findings Actionable

Handling Multiple Findings

The Suppression UX

Installation: Making Setup Frictionless

What Happens at git commit --no-verify

Measuring Hook Effectiveness

The Broader Point: Shift Left Has a UX Requirement

What Happens at `git commit --no-verify`