How `git diff` Actually Works: The Myers Algorithm in Plain English

#git #algorithms #computerscience #beginners

You run git diff dozens of times a day. You read the red and green lines, you stage the hunks, you move on. But there's a small algorithmic miracle happening every time: out of the astronomically many ways to describe "how file A became file B," the tool quietly finds one of the shortest ones.

That's not a formatting trick. It's a shortest-path search. Once you see it, code review, merge conflicts, and "why did git think I moved this whole block?" all start to make sense.

A diff is the shortest edit script

Forget the colored output. A diff between two sequences of lines — A (length N) and B (length M) — is an edit script: a list of insertions and deletions that turns A into B. There are always many valid scripts. The dumbest one is "delete all of A, insert all of B" — technically correct, completely useless.

What you actually want is the script with the fewest edits, because the lines you didn't touch are the ones that stayed the same. So the real goal is the mirror image:

Find the longest common subsequence (LCS) of A and B. Everything in it is "unchanged"; everything else is an insert or a delete.

The two framings are the same problem. If the LCS has length L, the minimum number of edits is exactly N + M - 2L. Maximize the matches, minimize the edits. That number — the edit distance — is what a diff is really computing.

Why the obvious approaches fail

The first instinct is line-by-line: walk both files in lockstep, mark lines that differ. That breaks the instant you insert a single line near the top — everything below shifts by one and the naive walk reports the entire rest of the file as changed. Useless for code review.

The textbook fix is the LCS dynamic-programming table: an (N+1) × (M+1) grid, fill it in, backtrack. It's correct, but it's O(N × M) time and memory. For two 10,000-line files that's 100 million cells. Git can't afford that on every status.

The insight that makes real diffs practical: edited files are usually similar. Most of the lines match. So instead of paying for the whole grid, pay only for the changes.

Myers: a diff is a shortest path through an edit graph

The algorithm Git uses by default is Eugene Myers' 1986 diff — and the whole thing rests on one reframing. Lay file A along the top of a grid and file B down the side. Now you're navigating from the top-left corner (0,0) to the bottom-right (N,M) with three moves:

→  move right   = delete a line from A
↓  move down    = insert a line from B
↘  move diagonal = the two lines match (FREE)

A diagonal move is only allowed when A[x] == B[y], and — this is the key — it's free. Right and down moves each cost 1. So:

The minimum diff is the cheapest path from corner to corner, and "cheapest" means "takes the most free diagonals."

A diff is a shortest-path problem. The long diagonal runs — Myers calls them snakes — are your unchanged blocks. The horizontal and vertical steps between them are the hunks you see in red and green.

The trick: search by number of edits, not by grid size

Myers doesn't fill the grid. He does a breadth-first search ordered by D = the number of edits so far (D = 0, 1, 2, …), and at each level tracks only the furthest-reaching path on each diagonal k = x - y. Greedily slide down every free diagonal you can before spending another edit. The first time a path reaches (N,M), the D that got it there is the edit distance, and the snakes along the way are the diff.

D = 0:  try to snake straight down the main diagonal from (0,0)
D = 1:  spend one edit (one right OR one down), then snake as far as possible
D = 2:  spend two edits, snake again
...
stop the moment a path touches (N, M)

Because it stops at the true edit distance D, the cost is O(N × D) — proportional to the file size times the number of changes, not the product of the two file sizes. When D is small (the normal case: a few edited lines in a big file), it's effectively linear. When you paste a wildly different file, D blows up and it degrades gracefully toward the DP cost. That "fast when similar" property is exactly what you want from a tool that runs on every keystroke in your editor's gutter.

Where the same idea shows up

Once a diff is "shortest path that maximizes free diagonals," you spot the pattern everywhere:

git diff, git blame, merge: all built on this edit-script core. The three-way merge that resolves your conflicts is two diffs against a common ancestor, reconciled.
Code review tools (GitHub, GitLab): the side-by-side view is the edit graph rendered as two columns; the connecting highlights are the snakes.
Text/JSON/config comparison: comparing two API responses or two YAML files is the same LCS problem on lines or tokens. (Pro tip: pretty-print or sort keys on both sides first — a noisy diff is usually just two differently-formatted versions of identical data. Running each side through a JSON formatter before comparing collapses the diff down to the changes that actually matter.)
rsync / binary deltas: same shortest-edit goal, different granularity (blocks instead of lines).
Bioinformatics: sequence alignment of DNA is LCS with weighted moves. Same grid, fancier costs.

The catch: a minimal diff isn't always a readable one

Here's the part that bites people. Myers finds a shortest edit script — but "fewest edits" and "what a human meant" aren't always the same. Move a function and the minimal diff might pair the closing } of your function with the } of the next one, producing that maddening hunk where braces and signatures are interleaved nonsense. The algorithm isn't wrong; minimizing edit count just doesn't know your code has structure. (This is why Git ships a --diff-algorithm=patience and histogram — different heuristics for choosing which minimal-ish diff reads best.)

So when a diff looks insane, it's not a bug. It's the shortest path honestly disagreeing with your sense of meaning. Switching algorithms, or eyeballing it in a clean side-by-side view that lets you flip between unified and split layouts, usually makes the human-intended change pop back out.

The takeaway

Next time you read a diff, don't think "it compared the lines." Think "it found the cheapest path through an edit graph, taking every free match it could." The green and red are the edits; the silent unchanged lines are the snakes the algorithm worked hard to keep. A diff isn't a comparison — it's a shortest-path proof that these few changes are all it took to get from A to B.

That's the whole idea behind the tool you trust with every commit. Not magic — just a very clever way of counting halts on a grid.

If you want to watch the snakes and edits line up on real input, I keep a free, privacy-first diff checker that runs entirely in the browser (no upload) with unified and side-by-side views — handy for eyeballing exactly which lines the algorithm called "changed."