The Algorithm Hiding Behind git diff
You've run git diff hundreds of times.
Red lines. Green lines. Done.
But have you ever stopped and asked — what algorithm is actually doing that?
It turns out, it's one of the most classic problems in computer science: Longest Common Subsequence. And it's been hiding inside your terminal every single day.
In this article, we'll explore how Git-style diffing works, why LCS is the right mental model, how the actual algorithm Git uses (Myers diff) connects to it, and what tradeoffs real tools make when choosing a diff algorithm.
This is the first article in my series "DSA Application in Real Life" — where I explore how common data structures and algorithms power the tools developers use every day.
The Problem Git Is Solving
Imagine we have an old version of a file:
function add(a, b) {
return a + b;
}
Then we update it:
function addNumbers(a, b) {
return a + b;
}
When we run git diff, Git shows:
-function add(a, b) {
+function addNumbers(a, b) {
return a + b;
}
This looks obvious to us as humans. Only the function name changed.
But Git does not "understand" JavaScript the way we do. At the diffing level, Git treats the file as a sequence of lines. Its job is to compare two sequences and decide:
- Which lines stayed the same?
- Which lines were deleted?
- Which lines were added?
This is a sequence comparison problem — and that's exactly where LCS comes in.
Why Simple Line-by-Line Comparison Is Not Enough
A beginner might think Git just compares files line by line:
Old line 1 vs New line 1
Old line 2 vs New line 2
Old line 3 vs New line 3
This works only when changes happen at the same position. But real code changes are rarely that simple.
Consider this old file:
login()
validate()
save()
logout()
Now we insert one new line:
login()
checkPermission()
validate()
save()
logout()
A naive line-by-line comparison would produce:
Old: login() New: login() same
Old: validate() New: checkPermission() different
Old: save() New: validate() different
Old: logout() New: save() different
Old: (nothing) New: logout() added
That makes it look like almost the entire file changed — which is completely wrong. Only one line was added.
A smarter approach doesn't compare by position. It finds what's common between the two files first. That's the LCS idea.
LCS: The Mental Model Behind Diffing
LCS stands for Longest Common Subsequence.
A subsequence means you can pick elements from a sequence while keeping their relative order — but they don't need to be adjacent.
Example:
Old = [A, B, C, D]
New = [A, C, E, D]
The longest common subsequence is [A, C, D] — because A, C, and D appear in both sequences in the same order.
Applied to file diffing, the lines of each file become the sequences:
Old = [login(), validate(), save(), logout()]
New = [login(), checkPermission(), validate(), save(), logout()]
LCS = [login(), validate(), save(), logout()]
Now Git can reason:
- These lines are common → unchanged
-
checkPermission()is only in the new file → added
Result:
login()
+checkPermission()
validate()
save()
logout()
That's the core idea.
The Actual LCS Algorithm (with Code)
Here's the classic dynamic programming solution you've likely seen in competitive programming:
def lcs_length(A, B):
m, n = len(A), len(B)
# dp[i][j] = LCS length of A[:i] and B[:j]
dp = [[0] * (n + 1) for _ in range(m + 1)]
for i in range(1, m + 1):
for j in range(1, n + 1):
if A[i-1] == B[j-1]:
dp[i][j] = dp[i-1][j-1] + 1
else:
dp[i][j] = max(dp[i-1][j], dp[i][j-1])
return dp[m][n]
For sequences A = [A, B, C, D] and B = [A, C, E, D], the DP table looks like:
"" A C E D
"" [ 0 0 0 0 0 ]
A [ 0 1 1 1 1 ]
B [ 0 1 1 1 1 ]
C [ 0 1 2 2 2 ]
D [ 0 1 2 2 3 ]
The answer is dp[4][4] = 3 → LCS length is 3 → [A, C, D].
Time complexity: O(m × n)
Space complexity: O(m × n)
For large files, this gets expensive — which is why Git doesn't use this directly.
How LCS Builds the Diff
Once you know the LCS, building the diff is straightforward:
- Lines in the LCS → unchanged
- Lines in old but not in LCS → deleted (prefix with
-) - Lines in new but not in LCS → added (prefix with
+)
Example:
Old = [A, B, C, D]
New = [A, C, E, D]
LCS = [A, C, D]
B is only in Old → deleted
E is only in New → added
Diff output:
A
-B
C
+E
D
This is the basic shape of what Git, GitHub pull requests, VS Code comparison, and merge tools show: unchanged lines, deleted lines, and added lines.
Does Git Actually Use Textbook LCS?
Not directly. Git's default algorithm is Myers diff — and it solves a slightly different (but deeply related) problem called the Shortest Edit Script.
What is the minimum number of insertions and deletions needed to transform the old file into the new file?
The connection to LCS is direct:
LCS finds what is common.
Shortest Edit Script finds what changed.
If LCS is long → fewer edits needed.
If LCS is short → more edits needed.
They are two sides of the same coin.
So when we say "Git uses LCS-based diffing," the accurate meaning is:
Git's diffing is based on sequence comparison ideas rooted in LCS, but its default implementation uses Myers' shortest edit script algorithm — which is faster in practice.
How Myers Diff Actually Works (Simplified)
Myers models the diff problem as a graph search.
Imagine a grid where:
- The X-axis represents lines of the old file
- The Y-axis represents lines of the new file
- Moving right = delete a line from the old file
- Moving down = insert a line from the new file
- Moving diagonally = lines match (no edit needed)
For Old = [A, B, C, D] and New = [A, C, E, D]:
A B C D
┌────┬────┬────┬────┐
A │╲ │ │ │ │
├────┼────┼────┼────┤
C │ │ │╲ │ │
├────┼────┼────┼────┤
E │ │ │ │ │
├────┼────┼────┼────┤
D │ │ │ │╲ │
└────┴────┴────┴────┘
Myers finds the path from (0,0) to (N,M) that uses the most diagonal moves — because diagonals are free (matched lines). That path is the shortest edit script.
Time complexity: O(n × d) where d = number of differences
Space complexity: O(n)
This is much faster than O(m × n) LCS for files that are mostly similar — which is the common case in real codebases.
Diff as an Edit Script
Let's walk through a concrete edit script:
Old file → [A, B, C, D]
New file → [A, C, E, D]
Step 1: Delete B
A, C, D
Step 2: Insert E after C
A, C, E, D
Edit script: Delete B, Insert E — that's just 2 operations.
Git's diff output:
A
-B
C
+E
D
Clean, minimal, exactly right.
Why This Matters in Real Development
When we review code, we're not just looking at text changes — we're trying to understand intent.
A good diff makes that easy:
function calculateTotal(items) {
- return items.length;
+ return items.reduce((sum, item) => sum + item.price, 0);
}
Any reviewer immediately understands: the old code counted items, the new code sums their prices.
A bad diff creates noise and confusion. That's why diff algorithms matter — they're not just about correctness, they're about readability.
The Tradeoff: Shortest Diff vs Most Readable Diff
The "smallest" diff is not always the most readable one — especially in code with repeated patterns:
if (user) {
return true;
}
if (admin) {
return true;
}
if (owner) {
return true;
}
When many lines look similar, a diff algorithm can match the wrong lines. The result is technically correct but hard to read. That's why Git ships multiple algorithms.
Git's Four Diff Algorithms — With a Real Example
git diff --diff-algorithm=myers # default
git diff --diff-algorithm=minimal
git diff --diff-algorithm=patience
git diff --diff-algorithm=histogram
Here's what each one does and when to use it:
Myers (default)
Fast, generally good results. This is what runs when you just type git diff. Best for everyday use.
Minimal
Tries harder to find the absolute smallest diff. Slower, but useful when patch size matters (e.g., generating patches to send via email).
Patience
Prioritizes human readability. It only matches unique lines first, avoiding false alignments on repeated code. Best for reviewing refactors or moved code blocks.
Histogram
An evolution of Patience that also handles low-frequency lines well. Often produces the most readable output for real codebases. Many developers set this as their global default.
Side-by-side example — Myers vs Patience:
Given this change (a function was refactored):
# Old
def process(data):
result = []
for item in data:
result.append(item * 2)
return result
# New
def process(data):
return [item * 2 for item in data]
Myers output might show:
def process(data):
- result = []
- for item in data:
- result.append(item * 2)
- return result
+ return [item * 2 for item in data]
Patience output groups the change more cleanly around unique anchors, making it immediately obvious that the body was replaced with a list comprehension — less noise, same information.
To set histogram globally (recommended for most developers):
git config --global diff.algorithm histogram
Algorithm Complexity Summary
| Algorithm | Rough Idea | Best For |
|---|---|---|
| Textbook LCS DP |
O(m × n) time and space |
Learning the concept |
| Myers diff | Efficient when files are mostly similar | Default everyday diffs |
| Minimal | Spends extra work to reduce diff size | Smaller patches |
| Patience | Uses unique lines as anchors | Refactors / moved blocks |
| Histogram | Extends patience using low-frequency lines | Often readable code diffs |
Where m = number of lines in the old file and n = number of lines in the new file.
Where the DSA Is Hiding
In competitive programming, LCS is a textbook DP problem. In the real world, the same idea powers:
Git diff
GitHub pull request review
VS Code file comparison
Merge conflict resolution
Google Docs version history
Code review platforms (Gerrit, Phabricator)
Patch generation
The input changes — lines of code, words in a document, DOM nodes in a UI, events in a timeline — but the core question is always the same:
What stayed the same, and what changed?
A Real-World Developer Example
Old code:
function createUser(name, email) {
const user = { name, email };
saveUser(user);
return user;
}
New code:
function createUser(name, email, role) {
const user = { name, email, role };
validateUser(user);
saveUser(user);
return user;
}
A well-tuned diff shows:
-function createUser(name, email) {
+function createUser(name, email, role) {
- const user = { name, email };
+ const user = { name, email, role };
+ validateUser(user);
saveUser(user);
return user;
}
Any reviewer immediately understands: a role parameter was added, it's stored on the user object, and validation was introduced before saving. Three changes, instantly clear.
That's the value of a good diff algorithm — it's not just computing differences. It's helping humans understand change.
Why Git Works at the Line Level (Not Character Level)
Git diffs at the line level because source code is naturally line-based. A character-level diff would be more precise:
-const total = price * quantity;
+const total = price * quantity * tax;
vs character diff: * tax was appended.
But character-level diffs get noisy fast for code review. Line-level is the right abstraction for most developer workflows. (You can get word-level diffs with git diff --word-diff when you need them.)
The best algorithm isn't always the most precise one. It's the one that gives the most useful output for the context.
LCS vs Myers: The Mental Model
LCS: Find the longest part that stayed the same.
Myers: Find the shortest set of changes to get from old to new.
LCS gives you the intuition.
Myers gives Git an efficient practical algorithm.
If the LCS is long → few changes needed.
If the LCS is short → many changes needed.
They measure the same thing from different directions.
Why This Is a Great Example of DSA in Real Life
Many beginners ask: "Where do we actually use DSA in real projects?"
git diff is one of the best answers — because every developer runs it daily without thinking about it.
When you run git diff, you're using an algorithm.
When you review a pull request on GitHub, you're using an algorithm.
When you resolve merge conflicts, you're relying on algorithms that compare versions of files.
The algorithm is invisible behind a clean developer experience. That's what good engineering looks like — the user sees red and green lines, and behind it is a carefully designed algorithmic solution built on decades of computer science research.
That's the beauty of DSA. Not just for interviews. Inside the tools you use every day.
Practical Commands to Try
# Try different algorithms on any repo
git diff --diff-algorithm=myers
git diff --diff-algorithm=patience
git diff --diff-algorithm=histogram
git diff --diff-algorithm=minimal
# Word-level diff (great for prose or config files)
git diff --word-diff
# Set histogram as your permanent default
git config --global diff.algorithm histogram
Final Thoughts
LCS may look like just another DP problem when you first learn it. But the idea is powerful:
Find what stayed the same so we can understand what changed.
Git uses it to show file changes. Code review tools use it to help developers understand pull requests. Merge tools use it to combine work from different branches. Document editors use it to show version history.
So the next time you run git diff, remember: you're not just seeing red and green lines. You're seeing dynamic programming, graph search, and decades of algorithmic research — all compressed into a two-word command.





Top comments (0)