SEN LLC

Posted on Apr 15

The similar crate is underrated: I wrote a colored diff CLI in a weekend

#rust #cli #diff #tutorial

The similar crate is underrated: I wrote a colored diff CLI in a weekend

A tiny Rust CLI — diff-rs — that takes two files and prints a readable
unified diff, with per-word inline highlights on modified lines and an
optional side-by-side view. Two dependencies (clap, similar), ~600 lines,
a 9 MB Docker image.

📦 GitHub: https://github.com/sen-ltd/diff-rs

The problem nobody actually likes admitting

I want a diff. Not a git diff. Not a review tool. Just: given two files that
happen to sit on disk, show me what's different in a way my eyes can parse in
under two seconds.

The tools that exist:

diff -u — the venerable POSIX one. Monochrome. You hunt for the - and + columns in a wall of text. It does the job if you're already reading carefully, but it's brutal at a glance.
git diff --no-index a b — nicer (color, pager, hunk headers) but it prints "fatal: not a git repository" unless you know to pass --no-index, and you still carry all of git along for the ride.
delta — beautiful, but it's a pager for git. Not the same shape.
difftastic — structural / AST-level diff, gorgeous output. Also ~20 MB and opinionated about what "different" means at the syntactic level.
colordiff — a Perl wrapper around diff that adds colors via regex substitution. Works, and I'm mildly impressed it still ships, but my muscle memory is tired of installing Perl on Alpine images in CI.

None of these are wrong. But none of them are "tiny standalone binary that
colors a plain file-to-file diff nicely." So I wrote one, mostly to see what
the floor looks like when you lean on a good diff crate. The floor turned
out to be closer than I expected, because the similar crate does 90% of
the work.

Why `similar` is the interesting part

There's a lot of diff libraries in the Rust ecosystem. Most of them give you
"LCS on &str" and leave you to figure out the rest. Mitsuhiko's
similar is different. It ships:

Myers, Patience, and LCS algorithms, all pure Rust, no C deps.
TextDiff::from_lines, from_words, from_chars, from_graphemes — which means you can diff at whatever granularity you want without writing the tokenizer yourself.
An iteration API (iter_all_changes) that yields individual ChangeTag ops (Equal, Delete, Insert) with their text slices. No abstract cursor, no callbacks, no trait objects to wrestle with.
Unicode-aware word splitting out of the box.
A sensible default Myers implementation with a time budget so pathologically different inputs bail out before burning your CPU.

The whole API surface that diff-rs uses is maybe four types. Here's the
core wrapper in src/diff.rs, simplified:

use similar::{ChangeTag, TextDiff};

pub fn compute(old: &str, new: &str, opts: &DiffOptions) -> Hunks {
    let a = prepare(old, opts); // normalized lines + line numbers
    let b = prepare(new, opts);

    let a_text = format!("{}\n", a.keys.join("\n"));
    let b_text = format!("{}\n", b.keys.join("\n"));
    let diff = TextDiff::from_lines(&a_text, &b_text);

    let mut flat: Vec<LineChange> = Vec::new();
    let (mut ai, mut bi) = (0usize, 0usize);

    for change in diff.iter_all_changes() {
        match change.tag() {
            ChangeTag::Equal => {
                flat.push(equal(&a, &b, ai, bi));
                ai += 1; bi += 1;
            }
            ChangeTag::Delete => {
                flat.push(delete(&a, ai));
                ai += 1;
            }
            ChangeTag::Insert => {
                flat.push(insert(&b, bi));
                bi += 1;
            }
        }
    }

    group_into_hunks(flat, opts.context)
}

That's the entire diff engine: build the text, call TextDiff::from_lines,
iterate, append to a flat vector. Everything after that is rendering.

One real footgun I hit

from_lines splits on \n, which is fine — except for the very last line
in each file. If one file ends with a newline and the other doesn't, the
final line appears "different" even though it contains the same characters.
My first test run failed on "a\nb\nc\n" vs "a\nb\n" because similar saw
the last line of the old side as "c\n" and had no corresponding line on
the new side, which is correct but weird to reason about when you're pairing
lines up by position.

The fix: normalize both sides into a canonical form — split into logical
lines first, join back together with an explicit trailing newline on both.
Costs me one format!() per side and makes the cursor bookkeeping downstream
a lot simpler.

The "modified block" pairing problem

similar's raw output gives you a flat stream: Equal, Delete, Delete,
Insert, Insert, Equal, … A classic unified diff renders that stream
verbatim. A colored diff that wants to show per-word highlights has to
pair up the Deletes and Inserts to run a second word-level diff on each
pair. Which delete goes with which insert?

My first version was naive: "if line[i] is Delete and line[i+1] is
Insert, treat them as a pair." This is wrong the moment you have two
deletions followed by two insertions. The algorithm above would:

See line[0] = Delete, line[1] = Delete → no pair, emit -line0.
See line[1] = Delete, line[2] = Insert → pair! Emit -line1 / +line2 with word overlay.
See line[3] = Insert → no delete ahead, emit +line3.

So line[1] (a deletion) gets paired with line[2] (the first insertion),
skewing everything by one position. The output would word-diff completely
unrelated lines and the highlights would be junk.

The fix is to pair runs, not individual lines:

LineKind::Delete if opts.word_diff => {
    let del_start = i;
    while i < n && hunk.lines[i].kind == LineKind::Delete { i += 1; }
    let ins_start = i;
    while i < n && hunk.lines[i].kind == LineKind::Insert { i += 1; }

    let del_count = ins_start - del_start;
    let ins_count = i - ins_start;
    let pairs = del_count.min(ins_count);

    for k in 0..pairs {
        let del = &hunk.lines[del_start + k];
        let ins = &hunk.lines[ins_start + k];
        let (dl, il) = word_highlight(&del.content, &ins.content, &p);
        writeln!(out, "{}-{}{}", p.deletion(), dl, p.reset())?;
        writeln!(out, "{}+{}{}", p.addition(), il, p.reset())?;
    }
    // …leftover unpaired deletes and inserts fall through to plain +/- lines
}

Scan forward to find the full run of Deletes, then the full run of Inserts,
then pair them one-to-one by position. Any leftover (unpaired deletes or
inserts) gets rendered as plain - / + lines without the word overlay.
This is the same shape git diff --word-diff uses.

The word overlay

Once you have a (delete_line, insert_line) pair, the word highlight is
almost trivial because similar will do it again for you, this time on
words:

fn word_highlight(old: &str, new: &str, p: &Palette) -> (String, String) {
    let diff = TextDiff::from_words(old, new);
    let (mut del, mut ins) = (String::new(), String::new());
    for change in diff.iter_all_changes() {
        let text = change.value();
        match change.tag() {
            ChangeTag::Equal => {
                del.push_str(text);
                ins.push_str(text);
            }
            ChangeTag::Delete => {
                write!(del, "{}{}{}", p.word_deletion(), text, p.reset()).unwrap();
                del.push_str(p.deletion()); // re-apply line color after reset
            }
            ChangeTag::Insert => {
                write!(ins, "{}{}{}", p.word_addition(), text, p.reset()).unwrap();
                ins.push_str(p.addition());
            }
        }
    }
    (del, ins)
}

Two things worth calling out:

We iterate the word-level diff twice into two separate buffers. Equal words go into both; deletions go only into del; insertions only into ins. This is how one call to TextDiff::from_words produces the two sides of the rendered output.
Re-applying the line color after every reset. ANSI SGR codes don't stack, and \x1b[0m turns everything off, so after the inner highlight escape sequence ends we have to re-emit the outer red / green so the rest of the line keeps its background. A small gotcha but invisible if you don't test with a real terminal emulator.

The Palette pattern

Color handling is the kind of thing that accidentally grows into a crate. I
wanted zero deps for this, so I reused the Palette pattern from a previous
project (hexview):

#[derive(Clone, Copy)]
pub struct Palette { enabled: bool }

impl Palette {
    pub fn addition(&self) -> &'static str {
        if self.enabled { "\x1b[32m" } else { "" }
    }
    pub fn deletion(&self) -> &'static str {
        if self.enabled { "\x1b[31m" } else { "" }
    }
    // … etc
}

The trick is that every method returns a &'static str, not an Option or
a Cow or a newtype. When color is disabled the returned string is literally
empty, so call sites like write!(out, "{}-{}{}", p.deletion(), text, p.reset()) work unconditionally and pay zero cost per call. The branch on
enabled is paid once, at palette construction.

Detection follows the convention the ecosystem has converged on:

--no-color CLI flag wins.
NO_COLOR env var (per no-color.org) forces off.
FORCE_COLOR / CLICOLOR_FORCE force on (useful for CI, freeze, script, pipelines).
Otherwise: check stdout().is_terminal().

I ran into the need for FORCE_COLOR about fifteen minutes after shipping
the first build, while trying to take the screenshot at the top of this
article. freeze --execute runs the CLI with a piped stdout, so without
the override everything was uncolored.

Side-by-side: it's mostly alignment math

let col = opts.width.saturating_sub(3) / 2;  // width - " | " gutter
let col = col.max(10);                        // don't render uselessly narrow

for row in pair_rows(hunk) {
    writeln!(
        out,
        "{left_color}{:<width$}{reset} | {right_color}{:<width$}{reset}",
        left_text, right_text, width = col, …
    )?;
}

pair_rows walks the hunk and produces parallel rows: an Equal line
becomes (equal, equal), a Delete immediately followed by an Insert
becomes (delete, insert) on the same row (the classic "modified" case),
an unmatched delete becomes (delete, None) and an unmatched insert becomes
(None, insert). Long content gets truncated at char boundaries with a
trailing ….

That's genuinely the whole feature. The code reads like what the output
looks like.

What I didn't build

A portfolio piece isn't a product, and diff-rs has hard edges on purpose:

Whole files are loaded into memory. Myers diffing a 4 GB log file is never going to be fun; diff-rs just doesn't pretend to support it.
No --patience / --histogram. similar actually ships Patience, I just didn't wire it up. Adding a flag is ten lines.
No directory diff. diff-rs a b expects two files. Walking a tree is a different project.
No structural diff. If you want "if block moved, field renamed, bodies identical," use difftastic. diff-rs is a textual diff.
No pager integration. Output goes to stdout; pipe into less -R if you want paging. The tool is small enough that you can and should compose it.

These are not TODOs. They're the reasons the whole thing fits in 600 lines
and a 9 MB image.

Try it in 30 seconds

git clone https://github.com/sen-ltd/diff-rs
cd diff-rs
docker build -t diff-rs .
docker run --rm -v "$PWD:/work" diff-rs /work/src/diff.rs /work/src/render.rs \
  --word-diff

Or, if you have a Rust toolchain:

cargo install --path .
diff-rs old.txt new.txt --word-diff
diff-rs old.txt new.txt --format side-by-side --width 120
diff-rs old.txt new.txt --format json | jq '.hunks | length'

33 tests (cargo test), two real dependencies, release profile is
strip + lto + codegen-units=1 + opt-level="z" + panic="abort". The
stripped binary is about 600 KB; the Alpine runtime image is 9 MB.

Closing

Entry #162 in a 100+ portfolio series by
SEN LLC. Previous Rust CLIs in this series that
reused the Palette pattern:

hexview — the colored hex dump that the Palette struct in diff-rs is lifted from.
sqlite-stats — a zero-config SQLite inspector CLI.

If you've been reaching for diff -u out of habit, try swapping it for a
tool whose default output you don't have to squint at. Feedback welcome.

DEV Community

The similar crate is underrated: I wrote a colored diff CLI in a weekend

The similar crate is underrated: I wrote a colored diff CLI in a weekend

The problem nobody actually likes admitting

Why `similar` is the interesting part

One real footgun I hit

The "modified block" pairing problem

The word overlay

The Palette pattern

Side-by-side: it's mostly alignment math

What I didn't build

Try it in 30 seconds

Closing

Top comments (0)

The similar crate is underrated: I wrote a colored diff CLI in a weekend

The problem nobody actually likes admitting

Why similar is the interesting part

One real footgun I hit

The "modified block" pairing problem

The word overlay

The Palette pattern

Side-by-side: it's mostly alignment math

What I didn't build

Try it in 30 seconds

Closing

Why `similar` is the interesting part