SEN LLC

Posted on Apr 15

Stop Using `diff` on CSV Files — Build a Semantic CSV Diff in 200 Lines of Python

#python #cli #csv #tutorial

Stop Using `diff` on CSV Files — Build a Semantic CSV Diff in 200 Lines of Python

diff file1.csv file2.csv is almost always useless. When rows are reordered, a column moves, or one file has CRLF and the other LF, it produces a firehose of noise that hides the real changes. What you actually want is "which rows were added, removed, or modified, matched by a key column." That's a totally different tool, and it's about 200 lines of Python.

📦 GitHub: https://github.com/sen-ltd/csvdiff

I kept running into the same problem at 2am. A nightly job exports users from one system and imports them into another. Something is wrong with today's sync. I have yesterday's snapshot and today's. I need to know what changed. I type:

diff yesterday.csv today.csv | less

And I get a wall of garbage. Every row looks different, because yesterday someone added a last_login column in position 4 and shifted everything right. Or the export sorted by email today instead of by id. Or today's file has CRLF line endings because it came through a Windows machine. diff is a line-based tool. CSVs are tabular data. The abstractions don't line up.

So I built csvdiff, a small CLI that matches rows by a key column and tells you exactly which rows were added, removed, or modified — and for modified rows, which columns changed and what the before/after values are. Stdlib only. Ships in a ~60 MB Docker image. Four output formats, including GitHub Actions annotations you can drop straight into CI.

This article is about the design decisions, not the marketing pitch. I'll show you the algorithm, the data structure, and the reason why the whole thing is ~200 lines even with four output formats and composite keys.

The problem, concretely

Imagine you have these two files:

# users-before.csv
id,name,email,role
42,Alice,alice@example.com,viewer
77,Carol,carol@old.example.com,editor
99,Legacy User,legacy@example.com,viewer
100,Bob,bob@example.com,admin

# users-after.csv (rows reordered, id=42 promoted, id=77 email changed, id=99 removed, two new users)
id,name,email,role
100,Bob,bob@example.com,admin
42,Alice,alice@example.com,admin
77,Carol Chen,carol@new.example.com,editor
103,Diana,diana@example.com,viewer
104,Eve,eve@example.com,admin

diff on these files reports that basically every line changed, because row 100 moved to the top. Here's what csvdiff --key id says:

+ Added (2)
  id=103  name=Diana  email=diana@example.com  role=viewer
  id=104  name=Eve    email=eve@example.com    role=admin

- Removed (1)
  id=99   name=Legacy User  email=legacy@example.com  role=viewer

~ Modified (2)
  id=42  role: viewer → admin
  id=77  name: Carol → Carol Chen, email: carol@old.example.com → carol@new.example.com

= Unchanged: 1 rows

That's what a human actually needs at 2am.

Why not an existing tool?

I did look. There are several options, and each one was wrong for me in a specific way:

daff — solid, cross-language, does exactly this. But it's published as a Node package and dragging a Node runtime into a Python shop just for diffs is awkward, and the default output isn't designed for CI annotations.
csv-diff (Simon Willison's Python tool) — closest to what I wanted, but it emits a single human-readable text format and doesn't have first-class JSON or GitHub Actions output. If I want to pipe it into a bot, I'm parsing its prose.
pandas DataFrame.compare() — works beautifully in a notebook. In a 60 MB Docker image for a CLI tool, pandas is immediately out: it pulls in numpy and balloons the image past 300 MB. For row-level diffing we don't need any of the machinery pandas brings.

The wedge is "CLI-first, stdlib-only, CI-friendly output formats." And I wanted to write it, because the algorithm is small and I'd rather own 200 lines of pure Python than a transitive dependency on 30 MB of numeric libraries I'm not using.

The design: a `DiffResult` data structure

The most important decision is the one that made everything else trivial: separate the diff algorithm from the output.

diff.py is a pure function. It takes two lists of row dicts and a key spec, and returns a DiffResult dataclass. No I/O. No printing. No argparse. Tested in isolation:

@dataclass
class ModifiedRow:
    key: KeyTuple
    before: Row
    after: Row
    changed_columns: list[str]  # ordered for deterministic output


@dataclass
class DiffResult:
    key_columns: list[str]
    added: list[Row] = field(default_factory=list)
    removed: list[Row] = field(default_factory=list)
    modified: list[ModifiedRow] = field(default_factory=list)
    unchanged: list[Row] = field(default_factory=list)

    @property
    def has_differences(self) -> bool:
        return bool(self.added or self.removed or self.modified)

Once I have DiffResult, every output format is a ten-to-thirty-line pure function: format_human(result) -> str, format_json(result) -> str, format_summary(result) -> str, format_github(result) -> str. They don't call each other. They don't share state. Adding a fifth format tomorrow is mechanical.

This is the whole reason the project is 200 lines instead of 500. The alternative — a function that reads files, diffs rows, and prints output in a big switch statement — would be impossible to test end-to-end without writing fixture files, and impossible to add a new format to without duplicating the diff logic or hitting the filesystem.

The classification loop

Here's the core of diff.py. It's boring on purpose:

def diff_rows(
    before: Sequence[Row],
    after: Sequence[Row],
    *,
    key_columns: Sequence[str],
    ignore_columns: Sequence[str] = (),
) -> DiffResult:
    key_columns = list(key_columns)
    skip = set(key_columns) | set(ignore_columns)

    before_idx = _index_by_key(before, key_columns, label="file1")
    after_idx = _index_by_key(after, key_columns, label="file2")

    result = DiffResult(key_columns=key_columns)

    for key, row in after_idx.items():
        if key not in before_idx:
            result.added.append(row)

    for key, row in before_idx.items():
        if key not in after_idx:
            result.removed.append(row)
            continue
        other = after_idx[key]
        cols = [c for c in row.keys() if c not in skip]
        for c in other.keys():
            if c not in skip and c not in cols:
                cols.append(c)
        changed = [c for c in cols if row.get(c, "") != other.get(c, "")]
        if changed:
            result.modified.append(
                ModifiedRow(key=key, before=row, after=other, changed_columns=changed)
            )
        else:
            result.unchanged.append(row)

    return result

_index_by_key builds a dict keyed by the tuple of values at the key columns. Lookups are O(1), the whole classification is O(n). The cols loop preserves column order from the before row, then appends any columns that only appear in after, so if someone added an email column between yesterday and today, it still gets reported as "changed" rather than silently dropped.

Ignore columns get handled in exactly one place (skip = ...) and that's it. Composite keys work out of the box because the key is always a tuple. --key region,user_id becomes ("us", "42") and Python dict lookup handles the rest.

Duplicate keys: fail loudly

What if the same key appears twice in one file? This is a real concern: "primary keys" in CSVs aren't enforced, and a stale export can absolutely contain two rows with id=42. Two options:

Last-wins: silently overwrite. Fast. Wrong: the user has no idea their data is ambiguous.
Error out: refuse to run.

I picked (2). _index_by_key raises DuplicateKeyError the moment it sees a collision:

def _index_by_key(rows, key_columns, label):
    out = {}
    for row in rows:
        key = tuple(row.get(k, "") for k in key_columns)
        if key in out:
            pretty = ",".join(f"{c}={v}" for c, v in zip(key_columns, key))
            raise DuplicateKeyError(
                f"{label}: duplicate key {pretty}; pick a different --key"
            )
        out[key] = row
    return out

The CLI catches this and exits with code 2 (config error). If you hit it, either your data is broken or you picked the wrong key column — both things you want to know about, not things you want the tool to silently paper over.

The encoding-fallback reader

CSVs from Excel have a UTF-8 BOM at the start of the file (the famous \xef\xbb\xbf sequence). Plain utf-8 will decode those bytes as the unicode character U+FEFF and glue it to the first column name, so your header becomes \ufeffid,name,email and every lookup for column id fails. The fix is two lines:

def _decode(path: Path) -> str:
    """Try utf-8-sig (strips BOM if present) then plain utf-8."""
    raw = path.read_bytes()
    # utf-8-sig first: it strips a leading BOM if present, and otherwise
    # behaves exactly like utf-8. Pure utf-8 would happily decode the BOM
    # as U+FEFF and leave it glued to the first header name.
    for encoding in ("utf-8-sig", "utf-8"):
        try:
            return raw.decode(encoding)
        except UnicodeDecodeError:
            continue
    raise ReaderError(f"{path}: could not decode as utf-8 or utf-8-sig")

I got this wrong on the first pass. I wrote ("utf-8", "utf-8-sig") because utf-8 seemed like the "base case" to try first. A test caught it: utf-8 succeeds on a BOM file, it just produces garbage. Always try the BOM-aware decoder first. No chardet, no autodetection magic. If it isn't utf-8 or utf-8-sig, it's a config error and the user should convert with iconv before piping into csvdiff.

The GitHub Actions formatter

This is the one where the data-structure-first design really paid off. GitHub Actions has a convention: lines starting with ::notice::, ::warning::, or ::error:: get parsed by the runner and turned into annotations on the job page. I wanted csvdiff to output those lines directly so a CI step could just be:

- run: csvdiff --key id --format github reference.csv fresh.csv

Because DiffResult already knows everything, the formatter is twenty lines:

def format_github(result: DiffResult, *, color: bool = False) -> str:
    del color
    key_cols = result.key_columns
    lines: list[str] = []

    for row in result.added:
        key_str = _format_key(tuple(row.get(k, "") for k in key_cols), key_cols)
        lines.append(f"::notice title=csvdiff::added {key_str}")

    for row in result.removed:
        key_str = _format_key(tuple(row.get(k, "") for k in key_cols), key_cols)
        lines.append(f"::notice title=csvdiff::removed {key_str}")

    for mod in result.modified:
        key_str = _format_key(mod.key, key_cols)
        changes = ", ".join(
            f"{c}: {mod.before.get(c, '')} -> {mod.after.get(c, '')}"
            for c in mod.changed_columns
        )
        lines.append(f"::warning title=csvdiff::modified {key_str} ({changes})")

    if not lines:
        lines.append("::notice title=csvdiff::no differences")

    return "\n".join(lines) + "\n"

Modified rows are ::warning:: and added/removed are ::notice:: because the reviewer usually wants their eye drawn to the changes. If the diff is empty, we emit a single informational notice so the run log isn't silent.

Tradeoffs I'm willing to defend

In-memory only. Both files are loaded into dicts. For 1M rows with a dozen columns this is a few hundred MB — fine. For a 50 GB CSV this is a disaster. The README says so explicitly. If you have those files, sort them by key and use comm. Streaming diff with unsorted input is fundamentally a different problem and I'm not going to pretend I'm solving both.

CSV schemas drift. If your columns rename themselves from week to week, csvdiff can't help you match name to full_name. The --case-insensitive-headers flag handles Name vs name, which is the 80% case. For actual column renames, you need a schema mapping layer, which is out of scope. CSVs are a problem domain that rewards boundaries.

No color-diff inside modified cells. I show role: viewer → admin as colored chunks, not character-level diff. Character diffs on arbitrary string values are noisy and csvdiff is targeted at tabular data, not prose.

Try it in 30 seconds

docker build -t csvdiff https://github.com/sen-ltd/csvdiff.git
mkdir -p /tmp/csvdemo && cd /tmp/csvdemo

cat > before.csv <<'EOF'
id,name,role
1,Alice,viewer
2,Bob,admin
3,Carol,editor
EOF

cat > after.csv <<'EOF'
id,name,role
1,Alice,admin
2,Bob,admin
4,Dave,viewer
EOF

docker run --rm -v "$PWD:/work" csvdiff --key id before.csv after.csv
docker run --rm -v "$PWD:/work" csvdiff --key id --format summary before.csv after.csv
docker run --rm -v "$PWD:/work" csvdiff --key id --format json before.csv after.csv
docker run --rm -v "$PWD:/work" csvdiff --key id --format github before.csv after.csv

You'll see the same row classified four different ways, which is exactly the point of building output formats as pure functions on top of a shared data structure.

Takeaway

When you reach for pandas to compare two CSVs, stop and ask whether you're paying for all of numpy to save 200 lines of Python. For row-level structural diffing — which is what 90% of "I need to compare these CSVs" tasks actually are — stdlib csv plus a dict keyed by a key-column tuple is enough. The whole tool is one data structure and four pure formatting functions on top of it. Separate the algorithm from the I/O and the rest writes itself.

Source, tests, and Dockerfile: https://github.com/sen-ltd/csvdiff

DEV Community

Stop Using `diff` on CSV Files — Build a Semantic CSV Diff in 200 Lines of Python

Stop Using `diff` on CSV Files — Build a Semantic CSV Diff in 200 Lines of Python

The problem, concretely

Why not an existing tool?

The design: a `DiffResult` data structure

The classification loop

Duplicate keys: fail loudly

The encoding-fallback reader

The GitHub Actions formatter

Tradeoffs I'm willing to defend

Try it in 30 seconds

Takeaway

Top comments (0)

Stop Using diff on CSV Files — Build a Semantic CSV Diff in 200 Lines of Python

The problem, concretely

Why not an existing tool?

The design: a DiffResult data structure

The classification loop

Duplicate keys: fail loudly

The encoding-fallback reader

The GitHub Actions formatter

Tradeoffs I'm willing to defend

Try it in 30 seconds

Takeaway

Stop Using `diff` on CSV Files — Build a Semantic CSV Diff in 200 Lines of Python

The design: a `DiffResult` data structure