DEV Community

How I built a tool that uses git history to find the files most likely to break your codebase

Every codebase has files that look fine but will take down prod if you breathe on them wrong. The problem is there's no obvious marker on them. No comment that says "this couples to six other things." No warning in the code review checklist. You only find out after the incident.

I wanted to build something that surfaces these files automatically. The result is fearmap, which mines git history to classify every file as LOAD-BEARING, RISKY, DEAD, or SAFE. This post is about the methodology and some things I learned building it.


Why git history instead of static analysis

The obvious approach is static analysis. Parse the imports, build a dependency graph, flag the highly connected nodes. It works, but it misses a lot.

Static analysis tells you about declared dependencies. It doesn't tell you about the hidden ones. Two files that always get edited together in the same commit are coupled, even if nothing in the code makes that explicit. That coupling lives in developer behaviour, not in the import graph. Configuration files, shared test fixtures, files that have to stay in sync manually -- these show up clearly in git history and are invisible to static analysis.

There's also the human signal. A file that 25 different developers have touched over three years is scarier than one maintained by a single person who understands it deeply. That's not about the code quality, it's about accumulated assumptions and the risk of changes conflicting with things nobody remembers anymore.


The heat formula

Each file gets a score from 0 to 100:

heat = churn * 0.40 + coupling * 0.35 + authors * 0.15 + size * 0.10

All four components are min-max normalised across the repo before weighting, so the score is always relative to that specific codebase. A file with churn of 50 commits means something different in a repo with 100 total commits versus one with 10,000.

The weights came from thinking about which signals are actually predictive. Churn is the strongest signal -- a file that changes constantly is either doing too much or is genuinely central to the system. Coupling is almost as strong, and is often the most surprising signal because it's invisible without tooling. Authors contributes less but captures knowledge diffusion risk. Size is the weakest predictor and mostly acts as a tiebreaker.


How coupling detection works

This was the hardest part to get right.

The basic idea is a co-occurrence matrix. For each commit, collect the set of files changed. For every pair of files in that set, increment their co-occurrence count. After processing all commits, you have a count for every pair: how many times did these two files change in the same commit.

Coupling strength for a pair is normalised by the churn of the file you're looking at:

coupling_strength = co_change_count / file_churn_count

So if file A has changed 10 times and was in the same commit as file B 8 of those times, the coupling strength is 80%. That's a strong signal.

The raw git command for this is an awk script that groups file lists by commit and emits pairs:

git log --since="12 months ago" --name-only --format="%n---SEP---" | awk '
/^---SEP---/ { if (n>1) for(i=0;i<n;i++) for(j=i+1;j<n;j++) print files[i]"|"files[j]; n=0; next }
NF>0 && !/^---SEP---/ { files[n++]=$0 }
END { if (n>1) for(i=0;i<n;i++) for(j=i+1;j<n;j++) print files[i]"|"files[j] }
' | sort | uniq -c | sort -rn
Enter fullscreen mode Exit fullscreen mode

The Python CLI version uses pydriller instead, which gives cleaner access to commit metadata and lets you filter out formatting-only commits and dependency bumps before they pollute the coupling signal.


Classification thresholds

Once every file has a heat score and a coupling partner count, classification is straightforward:

  • LOAD-BEARING: heat >= 70 AND coupling partners >= 3
  • RISKY: heat >= 40 OR coupling partners >= 2
  • DEAD: no changes in 18+ months AND churn = 0 AND no detected callers
  • SAFE: everything else

The DEAD classification has a secondary check. Before marking a file as dead, the tool greps for its name in CI configs, Dockerfiles, shell scripts, and YAML files. Files can be referenced dynamically or called from outside the repo -- git history alone can't see that.


What the Flask dogfood run showed

I ran it against the Flask source as a real-world test. The top results made sense:

src/flask/app.py scored 90/100 and came out LOAD-BEARING. 27 different authors, 10 commits in the last year, and coupling to ctx.py, templating.py, sessions.py, helpers.py and the sansio base class. That's the central WSGI application object, so the classification is correct.

src/flask/sansio/app.py scored 84/100. This is the I/O-agnostic base class that was split out to support async runtimes. 13 coupling partners. LOAD-BEARING.

The more interesting signal was sansio/scaffold.py showing 100% co-change rate with testing.py. They've changed in the same commit every single time in the last year. That's a coupling that's totally invisible from the import graph but immediately obvious from the history.


Two delivery modes

The tool ships in two ways and they work differently.

The Python CLI uses pydriller for git data and the Anthropic SDK for explanations. It's for CI pipelines and automation. fearmap run --local gives you the scores and classifications without any API calls. fearmap run --yes adds plain-English intent/danger/ripple explanations for the high-heat files.

The Claude Code version is a slash command. Drop one markdown file into .claude/commands/ and type /fearmap. Claude Code runs the git commands with its bash tool, applies the scoring formula, reads the dangerous files, and writes FEARMAP.md. No Python, no API key, no external calls -- Claude Code is already the reasoning engine so there's nothing to call.

The interesting thing about the second approach is that the "slash command" is really just a structured task description. The instructions tell Claude Code what to do step by step. It turns out Claude Code is quite good at following a precise multi-step analysis algorithm when the steps are written clearly enough.


What I'd do differently

The 12-month history window is a default that works for active repos but is too short for stable libraries that change rarely. Repos like Flask have files that haven't been touched in two years but are absolutely load-bearing. A better default would adapt to the repo's commit frequency.

The coupling detection also treats all commits equally. A massive bootstrap commit that adds 40 files at once will create false coupling pairs between all of them. Filtering out initial commits and large batch additions would clean up the signal considerably.


GitHub: Fearmap Repo

Top comments (0)