DEV Community

Cover image for How Diff Algorithms Actually Work (And Why Yours Might Be Wrong)
Michael Lip
Michael Lip

Posted on • Edited on • Originally published at zovo.one

How Diff Algorithms Actually Work (And Why Yours Might Be Wrong)

Every developer uses diff daily through git. But few understand the algorithm that powers it, and even fewer know why different diff tools produce different outputs for the same input.

The diff problem is formally stated as: given two sequences, find the longest common subsequence (LCS). The diff is everything not in the LCS. This is an NP-hard problem in the general case, but practical algorithms solve it efficiently for text.

The Myers diff algorithm

Git uses a variation of the Myers diff algorithm (1986). The core idea is to find the shortest edit script -- the minimum number of insertions and deletions to transform sequence A into sequence B.

Myers models this as a graph search problem. Imagine a grid where the X-axis represents elements of sequence A and the Y-axis represents elements of sequence B. Moving right is a deletion (remove from A). Moving down is an insertion (add from B). Moving diagonally means the elements match (no edit needed).

The shortest edit script is the path from (0,0) to (N,M) that maximizes diagonal moves (matches) and minimizes horizontal and vertical moves (edits).

    c  a  t
  +--------
d | \
o |   \
g |     \
Enter fullscreen mode Exit fullscreen mode

The algorithm uses a breadth-first approach, exploring paths with 0 edits, then 1 edit, then 2, until it finds a complete path. This guarantees minimality.

Why diffs differ

The LCS is not always unique. Two texts might have multiple valid LCSs of the same length, producing different (but equally valid) diffs. Different tools make different choices about which LCS to prefer.

Consider:

Old: A B C D
New: B C D A
Enter fullscreen mode Exit fullscreen mode

One valid diff: delete A from position 0, insert A at position 4 (1 deletion + 1 insertion).
Another valid diff: delete A from position 0, insert B C D before A (1 deletion + 3 insertions).

Both are correct. The first is shorter. But sometimes the longer diff is more readable because it preserves the visual structure that humans expect.

Line-level vs. word-level vs. character-level

Most diff tools operate at the line level. Two lines are either identical or different. This works well for code but poorly for prose.

If you change one word in a long paragraph, a line-level diff shows the entire paragraph as changed. A word-level diff highlights just the changed word. A character-level diff highlights the specific characters.

Implementing word-level diff means tokenizing lines into words and running the diff algorithm on word tokens instead of line strings. Character-level diff is the same concept at a finer granularity. The trade-off is precision vs. noise -- character-level diffs on code can be visually overwhelming.

The best approach is hierarchical: line-level diff first, then word-level diff within changed lines. This is what GitHub's pull request view does.

Practical diff uses beyond git

Contract review. Comparing two versions of a legal document to find every change. Word-level diff is essential here because line boundaries are meaningless in flowing prose.

Configuration comparison. Comparing two config files to find discrepancies. This is common in debugging environment-specific issues.

Database migration verification. Diffing the expected schema against the actual schema to catch drift.

API response comparison. Comparing expected vs. actual API responses in tests. JSON diff tools that understand structure (ignoring key ordering, for example) are more useful than raw text diff for this.

Whitespace and normalization

Should foo bar (two spaces) diff against foo bar (one space)? In code, whitespace is often significant. In prose, it usually is not. Diff tools need configurable whitespace handling:

  • Strict: all whitespace differences are shown
  • Ignore trailing: trailing whitespace changes are hidden
  • Ignore all whitespace: any whitespace change is hidden
  • Normalize: collapse all whitespace to single spaces before comparing

Git's --ignore-space-change and --ignore-all-space flags handle this, but web-based diff tools often lack these options.

The tool

I built a text diff checker at zovo.one/free-tools/text-diff-checker that provides side-by-side and inline diff views with line-level and word-level highlighting. It handles whitespace normalization options and supports large texts without performance issues. Paste two texts, see every difference highlighted.

I'm Michael Lip. I build free developer tools at zovo.one. 500+ tools, all private, all free.

Top comments (0)