DEV Community

Cover image for I Dumped Linux's Entire Git History Into a Database — and What I Found Felt Like Archaeology
Juan Torchia
Juan Torchia

Posted on • Originally published at juanchi.dev

I Dumped Linux's Entire Git History Into a Database — and What I Found Felt Like Archaeology

I made a mistake that took me years to recognize as a mistake: I treated git history as something you scroll through when something breaks, then close.

I'm not saying this to beat myself up. I'm saying it because most devs do exactly the same thing. And when you finally treat it as data — as rows in a table you can query, filter, aggregate — you end up looking at your own work like it belongs to a stranger. And that's unsettling in the best possible way.

The whole thing was triggered by a post about pgit, a project that loads the entire Linux kernel history into PostgreSQL. The guy queried 1.2 million commits with SQL. Found authorship patterns, merge velocity by subsystem, who commits at what time of day. Software archaeology in real time.

I read that on a Saturday afternoon and three hours later I was doing the same thing with my own repos.

Linux kernel git history with pgit: what it is and why it matters

pgit is conceptually simple: it takes the output of git log — with all its fields: author, timestamp, modified files, diff size, message — and inserts it into relational tables. Then you write SQL on top.

What sounds obvious when you explain it that way is revolutionary in practice. Because git log --oneline gives you a list. PostgreSQL gives you a model.

The difference is the difference between reading a book and being able to grep across every book you've ever read.

The Linux kernel has data going back to 1991. Linus Torvalds has commits from before I even knew computers existed. There are commits from people who have since died. There are technical decisions you can trace back to specific conversations in a specific week of a specific year. It's digital archaeology with perfect stratigraphy.

But the kernel belongs to someone else. My own stuff interested me more.

How I built my own version with personal repos

I didn't use pgit directly — I adapted the idea. Same concept: git log with a custom format, piped to a script that parses and inserts into PostgreSQL.

Here's the schema I put together:

-- Main commits table
CREATE TABLE commits (
  hash        CHAR(40) PRIMARY KEY,
  repo        TEXT NOT NULL,           -- which repo this came from
  author      TEXT NOT NULL,
  email       TEXT NOT NULL,
  date        TIMESTAMPTZ NOT NULL,
  message     TEXT NOT NULL,
  files       INTEGER DEFAULT 0,       -- how many files were touched
  insertions  INTEGER DEFAULT 0,
  deletions   INTEGER DEFAULT 0
);

-- Indexes for the queries I'll run often
CREATE INDEX idx_commits_date   ON commits(date);
CREATE INDEX idx_commits_repo   ON commits(repo);
CREATE INDEX idx_commits_author ON commits(author);
Enter fullscreen mode Exit fullscreen mode

And the ingestion script:

#!/bin/bash
# ingest_repo.sh — loads a repo's history into postgres

REPO_PATH=$1
REPO_NAME=$2
DB_URL=${DATABASE_URL:-"postgresql://localhost/gitarchive"}

if [ -z "$REPO_PATH" ] || [ -z "$REPO_NAME" ]; then
  echo "Usage: ./ingest_repo.sh /path/to/repo repo-name"
  exit 1
fi

cd "$REPO_PATH" || exit 1

# Format: hash|author|email|iso-date|files|insertions|deletions|message
git log \
  --format="%H|%an|%ae|%aI|%x00" \
  --numstat \
  | awk '
    # Parse the mixed format from git log with --numstat
    /^[0-9a-f]{40}\|/ {
      if (hash != "") print hash"|"author"|"email"|"date"|"files"|"ins"|"del"|"msg
      split($0, a, "|")
      hash=a[1]; author=a[2]; email=a[3]; date=a[4]
      files=0; ins=0; del=0; msg=""
      next
    }
    /^[0-9]+\t[0-9]+\t/ {
      ins += $1; del += $2; files++
      next
    }
  ' \
  | psql "$DB_URL" -c "
    COPY commits(hash,author,email,date,repo,files,insertions,deletions,message)
    FROM STDIN
    WITH (FORMAT CSV, DELIMITER '|')
  " --set repo="$REPO_NAME"

echo "Done: $REPO_NAME ingested."
Enter fullscreen mode Exit fullscreen mode

It's not perfect — commit messages with pipes in them will break it, I know. But for exploratory analysis it works fine.

I ingested nine of my repos. Freelance projects, experiments, the monorepo from my current job. Total: 4,847 commits between 2020 and 2024.

What I found: the uncomfortable parts

I started with innocent queries:

-- What time of day do I commit most?
SELECT
  EXTRACT(HOUR FROM date) AS hour,
  COUNT(*) AS count,
  ROUND(AVG(insertions + deletions)) AS avg_lines
FROM commits
WHERE author LIKE '%Torchia%'
GROUP BY hour
ORDER BY count DESC;
Enter fullscreen mode Exit fullscreen mode

Results: my peaks are at 11am and 10pm. Fine so far. But the average lines per commit at 10pm is double what it is at 11am. I commit more at night with bigger changes. That sounds productive until you look at the message quality:

-- Commit messages by hour — worst ones first
SELECT
  EXTRACT(HOUR FROM date) AS hour,
  message,
  insertions + deletions AS lines_changed
FROM commits
WHERE
  author LIKE '%Torchia%'
  AND EXTRACT(HOUR FROM date) BETWEEN 21 AND 23
ORDER BY date DESC
LIMIT 20;
Enter fullscreen mode Exit fullscreen mode

The results were embarrassing enough that I'm not pasting them here. "fix", "wip", "no idea what happened but it works", "fix from before". I commit more at night, with more changes, and with less care about communicating what I actually did. Perfect correlation with my worst self as a programmer.

Then I looked for file patterns:

-- What categories do I touch most?
-- (Requires a separate files table; this is an approximation)
SELECT
  CASE
    WHEN message ILIKE '%.tsx%' OR message ILIKE '%component%' THEN 'frontend'
    WHEN message ILIKE '%.sql%' OR message ILIKE '%migration%' THEN 'database'
    WHEN message ILIKE '%docker%' OR message ILIKE '%deploy%' THEN 'infra'
    WHEN message ILIKE '%test%' OR message ILIKE '%.spec%' THEN 'tests'
    ELSE 'other'
  END AS category,
  COUNT(*) AS commits,
  SUM(insertions) AS lines_added
FROM commits
WHERE author LIKE '%Torchia%'
GROUP BY category
ORDER BY commits DESC;
Enter fullscreen mode Exit fullscreen mode

Result: the "tests" category is 3% of the total. The team talks about testing in every retro. I commit tests 3% of the time. The data doesn't lie the way you'd want it to.

The most revealing one was this:

-- Commit velocity by project — where did I lose my rhythm?
SELECT
  repo,
  DATE_TRUNC('month', date) AS month,
  COUNT(*) AS commits_that_month,
  MAX(date) - MIN(date) AS month_span
FROM commits
GROUP BY repo, month
ORDER BY repo, month;
Enter fullscreen mode Exit fullscreen mode

There's a project where I committed 180 times in November 2022 and zero in December. Literally zero. What the data doesn't tell me is why — but I know: that project burned me out. Seeing that cutoff so sharply defined in a SQL query is different from vaguely remembering it. It's like seeing a scar show up on an X-ray.

The gotchas I didn't anticipate

Encoding will break your ingestion. Old repos have messages in latin-1, badly declared UTF-8, weird characters in author names. Add iconv -f UTF-8 -t UTF-8 -c to the pipeline to sanitize before inserting.

Merges will inflate your numbers. A merge commit can carry thousands of changed lines that actually belong to another branch. Filter with --no-merges if you want to analyze real work, or keep them separate with an is_merge BOOLEAN column.

Timestamps lie if your team is remote. Commits carry the committer's timezone. Someone in UTC-3 committing at 11pm shows up as 2am UTC. For time-of-day analysis, normalize everything to one timezone before aggregating.

Author identity is a mess. I have commits as "Juan Torchia", "juanchi", "jtorchia", "Juan T.", plus my work email versus my personal one. Without normalization, SQL will tell you four different people worked on the same repo. I built an aliases table:

-- Table to normalize author identities
CREATE TABLE author_aliases (
  original_email  TEXT PRIMARY KEY,
  canonical_name  TEXT NOT NULL
);

INSERT INTO author_aliases VALUES
  ('juanchi@gmail.com',      'Juan Torchia'),
  ('juan@work.com',          'Juan Torchia'),
  ('jtorchia@client.com',    'Juan Torchia');

-- Query with join to normalize
SELECT
  COALESCE(aa.canonical_name, c.author) AS real_author,
  COUNT(*) AS total_commits
FROM commits c
LEFT JOIN author_aliases aa ON c.email = aa.original_email
GROUP BY real_author
ORDER BY total_commits DESC;
Enter fullscreen mode Exit fullscreen mode

This normalization work reminded me of what I went through migrating a monorepo from npm to pnpm — the tedious part is always cleaning up historical data, not the migration itself. The install time dropping from 14 minutes to 90 seconds was the sexy part; the hours before that deduplicating dependencies were not.

FAQ — Frequently asked questions about analyzing git history with SQL

Do I need pgit specifically, or can I do this with any database?

You don't need pgit. It's an inspiration, not a requirement. With a custom git log --format and any parsing script you can fill a table in PostgreSQL, SQLite, or even DuckDB (which is ideal here because you can query CSV files directly without even creating tables). pgit is an opinionated Perl implementation; the idea is completely portable.

How much space does the Linux kernel history take in PostgreSQL?

The full kernel history with basic metadata (no full diffs) is around 2–4 GB. If you include the content of every patch, you're talking terabytes. For normal personal repos — thousands of commits, not millions — expect a 50–200 MB database. Totally manageable on any small Railway or Supabase instance.

Does this work for analyzing team work or is it just for personal projects?

Works great for teams, but you have to be careful with context. A low commit count doesn't mean someone is doing less work — it can mean they make larger commits, work on long-lived branches, or are in a role that doesn't require frequent commits (code review, architecture, documentation). Data is data; interpretation requires human context. Using this for individual performance metrics without that context is a terrible idea and a reliable way to destroy team trust.

What about the content of commits, not just the metadata?

If you want to analyze the actual diff content — what changed inside the files — the volume explodes fast. The most practical approach is to store only metadata in SQL and use git show <hash> on demand to retrieve the diff when you need it. Alternatively, you can store the full diff in a TEXT column or JSONB field, but for large repos it'll be slow and expensive on storage. For full-text search over diffs, something like Elasticsearch or even PostgreSQL's FTS can help.

Are there already-built tools for this without having to wire up the pipeline yourself?

Yes. git-quick-stats gives you fast analysis without a database. Hercules is more sophisticated and analyzes code burndown by author. gitinspector is another classic. The difference between those and building your own PostgreSQL pipeline is flexibility: with SQL you can answer any question that occurs to you, not just the ones the tool anticipated. If you're exploring, SQL wins. If you want a standard report, use the tools.

Does this relate to how LLMs analyze code?

Conceptually yes, and it's a genuinely interesting area. Some agent orchestration pipelines — like what we looked at with Scion — use git history as context so agents can understand how a codebase evolved. When an agent can query "what files were historically modified alongside this module," it's doing git archaeology in exactly the same way. The difference is that instead of SQL they're using embeddings and semantic search, but the data source is the same: git history treated as data.

What I took away from that afternoon

There's something unsettling about querying yourself with SQL. Not in an anxious way — in the sense that it gives you information your memory doesn't. I remembered the project that burned me out. I didn't remember the surgical precision with which I stopped committing. That's in the data. Data doesn't have memory bias.

The same thing I apply when analyzing real accessibility versus Lighthouse scores applies here: the numbers tell you something, but not everything. A low commit count can be discipline or it can be creative block. You know which one it is — the query doesn't.

What the history does tell you with certainty is where you put your attention. And that, over enough time, is a pretty honest portrait of who you are as a programmer.

My repos told me I commit better in the morning, that I barely write tests, and that when a project excites me the rhythm is impossible to miss in the data. Three things I "already knew" — but seeing them in a GROUP BY makes them hard to rationalize away.

If you have repos with any real history — even two or three years — I'd recommend spending an afternoon on this. Not to optimize yourself. To understand yourself.

The pipeline I built is on GitHub — if you want the full script, shoot me a message. And if you do this with your own repos and find something interesting (or uncomfortable), I genuinely want to know what came up.


If you want more software archaeology: I brought the same exploratory spirit to building the x509 certificate viewer extension — instead of keep parsing openssl output by hand, I treated it as a data problem. And when I got tired of waiting for someone to maintain the HAProxy extension, the story of how I got there also has commits with embarrassing messages at 11pm. The data doesn't lie.

Top comments (0)