How to Track Legal Document Changes with Git (And Why It Breaks)

#git #tutorial #productivity #opensource

If you've ever tried to diff two versions of a legal document, you know the pain. Word's "Track Changes" is a nightmare, PDF diffs are basically useless, and by the time you figure out what actually changed between v3 and v7 of a contract, you've mass-consumed coffee and lost the will to live.

So when I saw the legalize-es project on Hacker News — Spanish legislation tracked as a Git repository — my first thought was: "Obviously. Why isn't everything done this way?" My second thought, after actually trying to build something similar for a client's compliance docs, was: "Oh. That's why."

Let me walk you through the real problems you'll hit when storing legal or regulatory text in Git, and how to solve them.

The Core Problem: Legal Text Isn't Code

Git was designed for source code — short lines, clear structure, mostly ASCII. Legal documents are long paragraphs, deeply nested numbering schemes, and full of special characters. When you naively dump a law into a .txt file and start committing amendments, you get diffs that look like this:

- Artículo 15. Las entidades a que se refiere el artículo anterior deberán
- cumplir con las obligaciones establecidas en el presente título, sin
- perjuicio de las responsabilidades que pudieran corresponderles.
+ Artículo 15. Las entidades a que se refiere el artículo anterior, así
+ como aquellas contempladas en el artículo 12 bis, deberán cumplir con
+ las obligaciones establecidas en el presente título y en la normativa
+ complementaria que se desarrolle, sin perjuicio de las responsabilidades
+ que pudieran corresponderles.

That diff is technically correct but practically useless. The entire paragraph shows as changed when really only a clause was inserted. Git's line-based diffing falls apart with prose.

Solution 1: One Sentence Per Line

This is the single biggest win. Structure your legal text files with one sentence (or one clause) per line:

Artículo 15.
Las entidades a que se refiere el artículo anterior deberán cumplir con las obligaciones establecidas en el presente título.
Sin perjuicio de las responsabilidades que pudieran corresponderles.

Now when an amendment adds a clause, your diff is clean:

 Artículo 15.
-Las entidades a que se refiere el artículo anterior deberán cumplir con las obligaciones establecidas en el presente título.
+Las entidades a que se refiere el artículo anterior, así como aquellas contempladas en el artículo 12 bis, deberán cumplir con las obligaciones establecidas en el presente título y en la normativa complementaria que se desarrolle.
 Sin perjuicio de las responsabilidades que pudieran corresponderles.

Much better. You can immediately see what was added. This is the same principle behind Semantic Line Breaks — a convention the documentation community has used for years.

Solution 2: Structure Your Repo Like the Law Is Structured

Don't throw an entire legal code into one file. Mirror the actual hierarchy:

legislacion/
├── constitucion/
│   ├── titulo_preliminar.md
│   ├── titulo_I/
│   │   ├── capitulo_1.md
│   │   └── capitulo_2.md
│   └── ...
├── codigo_penal/
│   ├── libro_I/
│   │   ├── titulo_1.md
│   │   └── titulo_2.md
│   └── ...
└── ley_organica_3_2018/   # data protection
    └── ...

This way, git log -- codigo_penal/libro_I/titulo_1.md gives you the complete amendment history of that specific title. You can use git blame to see which reform touched each article. That's powerful.

Solution 3: Automate the Ingestion Pipeline

Here's where most projects like this get stuck. Getting the text into the repo in the first place is the hardest part. Official legal databases (Spain's BOE, France's Légifrance, the US Code) publish in various formats — XML, HTML, PDF, or just plain web pages.

A basic scraper-to-git pipeline looks like this:

import subprocess
from datetime import datetime
from pathlib import Path

def commit_law_version(law_id: str, text: str, effective_date: str, summary: str):
    """Write a law version to disk and commit it with metadata."""
    path = Path(f"legislacion/{law_id}.md")
    path.parent.mkdir(parents=True, exist_ok=True)
    path.write_text(text, encoding="utf-8")

    subprocess.run(["git", "add", str(path)], check=True)

    # Use the effective date as the commit date, not today
    # This makes git log show the actual legislative timeline
    commit_msg = f"{summary}\n\nEffective date: {effective_date}\nSource: BOE-{law_id}"
    env = {
        "GIT_AUTHOR_DATE": effective_date,
        "GIT_COMMITTER_DATE": effective_date,
    }
    subprocess.run(
        ["git", "commit", "-m", commit_msg],
        env={**subprocess.os.environ, **env},  # merge with existing env
        check=True,
    )

The key trick: set GIT_AUTHOR_DATE and GIT_COMMITTER_DATE to the law's effective date. This makes git log read like a legislative timeline instead of showing when you ran your scraper.

Solution 4: Use Git's Word Diff for Prose

When you do need to review changes, standard git diff is noisy for prose. Use word-level diffing instead:

# Word-level diff — highlights changed words, not whole lines
git diff --word-diff

# Even better: color only, no brackets
git diff --word-diff=color

# Set it as default for specific file types in .gitattributes
# echo '*.md diff=word' >> .gitattributes

This is a game-changer for legal text. Instead of seeing entire paragraphs marked as changed, you see the individual words that were added or removed.

The Deeper Problem: Branching Reality

Here's something I didn't anticipate when I first tried this approach. Laws aren't linear. A reform might amend five different codes simultaneously. An article might be "temporarily suspended" (not deleted, not modified — suspended). Some changes have future effective dates.

Git branches can model some of this:

main = currently effective law
Feature branches for pending reforms
Tags for specific reform milestones (ley-organica-3-2018, reforma-2024)

But suspended articles? Retroactive amendments? Those don't map cleanly to any Git concept. You'll need metadata — frontmatter in your markdown files, a separate JSON index, something. Git tracks what changed; you'll need to layer on why and when it takes effect.

Prevention Tips (For Your Own Projects)

If you're thinking about tracking any kind of regulatory or legal text in Git:

Start with the sentence-per-line convention from day one. Reformatting later means a massive commit that ruins your git blame history.
Use UTF-8 everywhere. Legal text in non-English languages will bite you with encoding issues. Set it explicitly in your scripts.
Commit messages are your changelog. Write them like a legislative summary: what changed, which reform introduced it, the official reference number.
Don't forget .gitattributes. Set *.md diff=word and *.txt diff=word so prose diffs are readable by default.
Automate or don't bother. Manual transcription doesn't scale and introduces errors. If you can't scrape the official source reliably, the repo will drift from reality.

Why This Matters Beyond Law

The legalize-es project is a great example of using developer tools outside their original domain. The same patterns work for tracking changes to terms of service, compliance documentation, medical guidelines, building codes — any corpus of text that evolves through formal amendments.

Git gives you a free, battle-tested audit trail with built-in tools for diffing, blaming, and searching history. The hard part isn't Git — it's shaping the content so Git's tools actually help instead of producing noise.

If you want to explore this further, look into GitLaw as a conceptual framework, and check out how Germany's BundesGit project tackled similar challenges with federal law. The problems are universal — and so are the solutions.