DEV Community

Cover image for ⛔ Squash commits considered harmful ⛔
Manuel Odendahl
Manuel Odendahl

Posted on

⛔ Squash commits considered harmful ⛔

A recurring conversation in developer circles is if you should use git --squash when merging or do explicit merge commits. The short answer: you shouldn't.

People have strong opinions about this. The thing is that my opinion is the correct one. Squashing commits has no purpose other than losing information. It doesn't make for a cleaner history. At most it helps subpar git clients show a cleaner commit graph, and save a bit of space by not storing intermediate file states.

Let me show you why.

Git tracks contents, not diffs

In many ways you can just see git as a filesystem.
– Linus (in 'Re: more git updates..' - MARC)

Git is in many ways a very dumb graph database. When you check in code, it actually stores the content of all the tracked files in your repository.

The content of each file is stored as a "blob" node in the database. The filenames are stored separately in a "tree" node: If you rename a file, no new content node will be created. Only a new tree node will be created.

Commits are store as "commit" nodes. A commit object points to a tree, and adds metadata: author, committer, message and parent commits. A merge commit has multiple parents.

Here is a visualization from Scott Chacon's Git Internals:

Image description

Looking at a real git repository

Enough theory, we have work to get done. Let's create a simple git repository:



> mkdir squash-merges-considered-harmful
> cd squash-merges-considered-harmful 
> git init
> echo hello > foo.txt
> git add foo.txt
> git commit -m "Initial commit"
[main (root-commit) 02a154b] Initial commit
 1 file changed, 1 insertion(+)
 create mode 100644 foo.txt
> echo more >> foo.txt
> git add foo.txt
> git commit -m "Add more" 
[main 16660f8] Add more
 1 file changed, 1 insertion(+)


Enter fullscreen mode Exit fullscreen mode

We can now look at the contents of the objects we created:



# initial commit
❯ git cat-file -p 02a154b
tree f269b7cd59094d5365ef6b5618098cbcbeee0c43
author Manuel Odendahl <wesen@ruinwesen.com> 1653303427 -0400
committer Manuel Odendahl <wesen@ruinwesen.com> 1653303427 -0400

Initial commit
# initial tree
❯ git cat-file -p f269b7cd59094d5365ef6b5618098cbcbeee0c43
100644 blob ce013625030ba8dba906f756967f9e9ca394464a    foo.txt
# initial foo.txt
❯ git cat-file -p ce013625030ba8dba906f756967f9e9ca394464a
hello

# second commit
❯ git cat-file -p 16660f8
tree 5a0c4a660a13c0ada7611651399abb362756f83e
parent 02a154bc4f0fa9bca567676d45d136619c076a95
author Manuel Odendahl <wesen@ruinwesen.com> 1653303485 -0400
committer Manuel Odendahl <wesen@ruinwesen.com> 1653303485 -0400

Add more
# second tree
❯ git cat-file -p 5a0c4a660a13c0ada7611651399abb362756f83e
100644 blob 2227cddb7f6318ea735a1c4adb52f5cd36c5783c    foo.txt
❯ git cat-file -p 2227cddb7f6318ea735a1c4adb52f5cd36c5783c
hello
more



Enter fullscreen mode Exit fullscreen mode

Branches, tags (and branches, tags on remote repositories) are just pointers to commit nodes.

cat .git/refs/heads/main         
16660f8b1d1538ed1b55d8533b3ee7feb68e474c


Enter fullscreen mode Exit fullscreen mode

But we still use diffs and merges

But Manuel, you ask, how does git diff and git merge and all that funky stuff work?

When you run git diff, git actually uses different diff algorithm to compare the state of two trees, every time.

When you do a rebase, git computes the diff for each commit of the branch before rebase, and then applies those diffs to the destination, thus "moving" the branch over to the destination, with fresh tree and commit nodes.

When you do a merge, git first searches for the common parent of both branches to be merged (this can be a bit more involved depending on your graph). It computes the diff of each branch to that original commit, and then merges both diffs in what is called a three-way merge.

The resulting commit has multiple parent fields. The parent fields don't really mean anything except for informational purposes, the tree the merge commit points to is what actually counts. Once a three-way merge has been computed and applied, git doesn't really care how the resulting tree was computed.

This is literally all there is to git, and the mental model that I use every day, even as I'm doing the most advanced git surgery.

What is a squash merge?

So what is a squash merge? A squash merge is the same as a normal merge, except that it doesn't record only parent commit. It basically slices off a whole part of the git graph, which will later be garbage collected if not referenced anymore. You're basically losing information for no reason.

Let's look at this in practice. Let's create a few commits on top of the ones we have, and then do both a squash merge and a non-squash merge, and look at the results.



> git checkout -B work-branch
Switched to a new branch 'work-branch'echo "Add more" >> foo.txt
❯ git add foo.txt && git commit -m "Add more"
[main 4b84cfe] Add more
 1 file changed, 1 insertion(+)echo "Add more" >> foo.txt                 
❯ git add foo.txt && git commit -m "And more"
[main 1836f1c] And more
 1 file changed, 1 insertion(+)
❯ git checkout -B no-squash-merge main
Switched to a new branch 'no-squash-merge'
❯ git merge --no-squash --no-ff work-branch
Merge made by the 'ort' strategy.
 foo.txt | 2 ++
 1 file changed, 2 insertions(+)
❯ git checkout -B squash-merge main
Switched to a new branch 'squash-merge'
❯ git merge --squash --ff work-branch
Updating 16660f8..1836f1c
Fast-forward
Squash commit -- not updating HEAD
 foo.txt | 2 ++
 1 file changed, 2 insertions(+)
❯ git commit
[squash-merge 150c57d] Squashed commit of the following:
 1 file changed, 2 insertions(+) 


Enter fullscreen mode Exit fullscreen mode

Let's look at the resulting graph and commits.



❯ git log --graph --pretty=oneline --abbrev-commit --all
* 150c57d (HEAD -> squash-merge) Squashed commit of the following:
| * 535b740 (no-squash-merge) Merge branch 'work-branch' into no-squash-merge
|/| 
| * 1836f1c (work-branch) And more
| * 4b84cfe Add more
|/  
* 16660f8 (main) Add more
* 02a154b Initial commit
❯ git cat-file -p no-squash-merge
tree 58c1fb22faa444b264e98a5ae4c4ddb07be09697
parent 16660f8b1d1538ed1b55d8533b3ee7feb68e474c
parent 1836f1c53221ae701a038bf5ae380770ea911665
author Manuel Odendahl <wesen@ruinwesen.com> 1653304391 -0400
committer Manuel Odendahl <wesen@ruinwesen.com> 1653304391 -0400

Merge branch 'work-branch' into no-squash-merge

* work-branch:
  And more
  Add more

squash-merges-considered-harmful on  squash-merge on ☁️  ttc (us-east-1) 
❯ git cat-file -p squash-merge   
tree 58c1fb22faa444b264e98a5ae4c4ddb07be09697
parent 16660f8b1d1538ed1b55d8533b3ee7feb68e474c
author Manuel Odendahl <wesen@ruinwesen.com> 1653304543 -0400
committer Manuel Odendahl <wesen@ruinwesen.com> 1653304543 -0400

Squashed commit of the following:

commit 1836f1c53221ae701a038bf5ae380770ea911665
Author: Manuel Odendahl <wesen@ruinwesen.com>
Date:   Mon May 23 07:11:08 2022 -0400

    And more

commit 4b84cfe11aa51da994448e602e1bc4cc6083d691
Author: Manuel Odendahl <wesen@ruinwesen.com>
Date:   Mon May 23 07:11:03 2022 -0400

    Add more



Enter fullscreen mode Exit fullscreen mode

You can see that save that both squash-merge and no-squash-merge point to the exact same tree. The only changed thing is the commit message, and the missing parent in the squash merge.

To read more about the underpinnings of git, I can recommend just experimenting with the git command line, and the following resources:

But the history!

But Manuel, you say, the history is so much cleaner!

To which I counter that it is actually not. If you want to hide the link to the right parent of the non-squash merge (as it is called, the left parent being main ), all you need to do is to hide it. If you use the command-line or a proper tool, use the option to only show first parents. If you only look at the first parent, and configure your git tool to fill in a full log history of the branch into the merge commit message (I personally use the github CLI gh or some git-commit hooks to do it), the squash merge commit is identical to the non squash merge commit.

A favorite git log command of mine to quickly look at the history of the main branch, and create a changelog:



> git log --pretty=format:'# %ad %H %s' --date=short --first-parent --reverse
# 2022-05-23 02a154bc4f0fa9bca567676d45d136619c076a95 Initial commit
# 2022-05-23 16660f8b1d1538ed1b55d8533b3ee7feb68e474c Add more
# 2022-05-23 535b740f42e331175f3766c1374116e329a78f7e Merge branch 'work-branch' into no-squash-merge


Enter fullscreen mode Exit fullscreen mode

When using github and pull requests, this will show author, branch name (which would contain ticket name and short description in my case) and date on a single line. Here's a slightly more complex real world example (anonymized)


2021-12-15 123 Merge pull request #5937 from garbo/TK-234/feature-1

2021-12-16 234 Merge pull request #5938 from bongo/TK-235/feature-2

2021-12-16 456 Merge pull request #5939 from gingo/TK-236/feature-3

Enter fullscreen mode Exit fullscreen mode




But why?

But Manuel, why keep all those commits lying around when we have all we need in the commit message?

One comes down to just preference. I like to see the actual log of what a person did on their branch. Did they do many small commits? On which days (this might make looking up documents or slack conversations related to the work easier)? Did they merge other branches into their work (useful when resolving merge conflicts and other boo boos)?

I have done a lot of git cleanup work, and while they are not supposed to exist, big merges with thousands of lines happen, and having a single monolithic commit that contains 80 different changes is a nightmare.

The other one actually makes the side history extremely useful. When hunting down for a bug, I often use git bisect. I first use git bisect --first-parent to jump from main commit to main commit. But once I found which pull request led to the bug, I bisect on the original branch. Instead of having to figure out which line in the pull-request merge might cause the bug, I have a much more granular path. Often, it surfaces a single line commit, and leads to a painless and immediate bugfix.

As you can drive your bisect with your unit tests, you often have no work to do at all, given sufficiently atomic and small commits on side branches. Losing that capability would seriously impact my sanity when I have to fix bugs.

Conclusion

And that is why squashing history is harmful. It's literally just deleting information from the git graph by losing a single parent entry into the merge commit.

Latest comments (80)

Collapse
 
mindplay profile image
Rasmus Schultz

The subtlety that's always missing in these discussions is that this isn't really (or at least doesn't need to be) an across-the-board decision.

In most work settings, there are two scenarios - features and minor changes:

  1. Guy works on a new feature for a week and makes tons of commits, essentially just using git like a "disk drive" - this sounds bad, but if that's what they set out to do for this particular task, squashing before submitting the PR is probably a must.

  2. Guy makes 4-5 changes to package.json, carefully documenting each change with commit messages explaining why he made that particular change. Squashing in this case would be absolutely terrible.

If you squash in case (2) you're making everyone's job harder. If I'm trying to solve a problem, and I place the cursor over a version constraint in package.json, and I see Guy's most recent commit to this line, I need to be able to see why he made that particular change - if it's been squashed with 4-5 other changes, I can't tell which reasons were the ones pertainining to that particular line change. I can't make sense of the changes, and I can't revert the change.

On the other hand, if you didn't squash in case (1) you're just leaving a very long and noisy commit log where every 10th or 50th commit explains anything actually useful.

Both of these situations are bad.

But if you've enabled squash commits as a default, you're going to lose a lot of useful information - that's why this is not a decision you can make across-the-board. It needs to be every contributor's decision if they squash - they should do that locally and not bother reviewers with even seeing that noisy history.

(and if your team's entire log history is useless noise that's never helpfully explains why any change is made, well, then squash away for all I care - you've got a much bigger cultural problem.)

Collapse
 
missrahee profile image
Adam Misrahi • Edited

Absolutely true that merge is the heart of git and squashing is a kind of perversion. It just so happens that it's exactly the right kind of perverted for some team workflows.

In my company we have a pretty strict rule about squashing your commits on your private branch into a neat, minimal history before you merge your PR. This makes sense for keeping shared history manageable but causes many problems.

  1. Each time the author thinks they're about to get approved, they squash and then get more requests for revisions. I don't need to tell you what a mess force pushing on an active PR can get you into, particularly if another contributor suddenly adds a commit.
  2. Right now I have a branch where I made a commit, then merged trunk, then added a tiny commit for a quick fix. Squashing that commit is kind of tricky because if I type git rebase -i HEAD~2 I don't get the last 2 commits, I get all the commits that were merged in. That's one of the reason our company's policy is to rebase on trunk, not merge it in. You can see how far we're getting from the promised land of merge-only purity here, and toward the greater pain of fixing merge conflicts with rebase.
  3. Rewriting history is tricky at the best of times. It's much easier to make a mistake than handling conflicts after a merge. It's stressful and bugs frequently result.

And there's also a much more important problem. With every PR, the last commit has passed through an extensive CI pipeline of checks. Ensuring every commit in the PR passes all checks infeasible. As a result, all those other commits must be assumed to be broken (especially when they're artificial Frankenstein commits created flyby in interactive rebase). These unsafe commits lie around as hand grenades with loose pins left strewn all around. Release managers must be extremely careful to avoid them when trying to assemble a good release. As a rule of thumb that means ignore everything except merge commits, but what if there's a fast-forward merge? We could get a workaround on that but just why should a large team of variously skilled developers be exposed to a history that is mostly made of unsafe commits?

A solution

  1. Ban force pushing everywhere
  2. Make squash-merge the only allowable place to rewrite history and encourage its use.

Outcome

  • Destructive and hard to manage bullshit during PR authorship ends
  • Trunk becomes a safe place with well-rounded commits to cherry-pick or revert without racking brains about
  • The PR is bolstered in a role as the smallest atom of meaningful change, which is the way most organisations are aiming to use it. This helps encourage the practice of keeping them small.

Drawbacks

  • git blame is less helpful for getting granular detail. If we were using git the way its free software advocates intended, as a complete and self-contained source of truth for codebase history, then this would be a dealbreaker. But that's not how we're using it. We have a self-hosted Gitlab and the PRs with all their attached comments and CI runs are a far richer source of historical information than the git repo alone. When I need to understand why something was done that's always where I first look. The purism of avoiding vendor lock-in is nice, but I've never seen it amount to something for repo hosting.
  • Banning all rebasing and squashing everywhere would go with the grain even better and allow lots of graph traversing techniques you mention. But I haven't yet worked somewhere that accepted the noisy work-in-progress git history that would usually entail.
  • You have to be strict about your branching workflow. Feature branch commits and squashed commits don't mix. If you want to base a PR not on trunk but on some other PR, then be prepared for that second PR to list a lot of nonsense commits that will confuse a reviewer and you'll have to delete all trace of before you merge in the resulting squashed message.

Conclusion

I agree with these people that the benefit far outweighs the gotchas in practice.

Collapse
 
ah profile image
Adrian H

For any developers who actually code professionally in a big team, this is horrible advice, nobody has time to care about your commit history it's just noise. People don't spend time investigating where this original issue happened in which sub-commit of which branch, blah blah, you are busy finding solutions and moving on to the next task.

Pedantic un-pragmatic advice, ignore this article and start coding, you are not an academic or a historian.

Collapse
 
istibekesi profile image
Békési István

“Squashing commits has no purpose other than losing information.”
Cleaning your house has no purpose other than losing dirt. Pretty neat tho…

Collapse
 
frankweindel profile image
Frank Weindel

What no one is mentioning is that the squash feature on GitHub PRs preserves the original commits that were squashed. If you REALLY need to go back and examine the granular history of a PR than you can still do so. On teams I've been on, we strive for PRs that are not over scoped and where that squashed message tells you exactly what feature / fix was added by those line changes. We also often use Conventional Commits, which help in a big way with release note automation. When I look at a main branch history like this:

fix: No long errors when pressing enter (#29) [tag: v1.1.0]
feat(ModuleA): Add metrics logging (#27)
feat(ModuleB): Support extra query string params in routes (#20)
chore: Update node to 16.13.1 (#24)
test: Require at least one assertion with every test (#21) [tag: v1.0.0]
Enter fullscreen mode Exit fullscreen mode

I see VERY clearly what features and fixes have gone in between versions 1.0.0 and 1.1.0. I also have easy links to the PRs that were squashed to produce those commits if I need to drill down any further. If a feature needs to be reverted it's a very easy reversion (no extra parameters).

If you end up with a PR that has a very large scope, there are two things one can/should do:

  • Split the PR in to multiple easier to review dependent PRs (preferred)
  • Carefully massage and curate the commits in a PR and use the Rebase Merge option in the PR
    • Making sure the PR number gets appended to each commit's title for backtracing.

I'll agree that just having a "linear commit history" shouldn't be the only reason for doing squash commits. But if it simplifies your team's workflow, reduces cognitive load, and makes understanding exactly what is included in a release easier to find then I say it's worth doing.

That said, I feel there is a "best of both worlds" place we could get to if we strived for it.

First, a merge commit is really a squash but with an extra parent link to a branch where the squash originated. I wish this was driven home more however the standard commit title for these PR merge commits is always something like these:

Merge pull request #43 from foo/fix-bar
Merge pull request #42 from foo/add-bar
Enter fullscreen mode Exit fullscreen mode

Those titles don't help me. It has the PR number, which I have to individually click on and look up, and a branch name, which could easily be too brief, poorly written or even irrelevant.

Nothing in Git, from what I understand, prevents those titles being similar to the squash titles I shared above. So if GitHub produced them you'd suddenly have the clarity you have with squash merges.

Second, the Git CLI commands and various Git UIs default to showing/working with the full branch out history of everything. You need to know special parameters or set certain settings in order to see and work with a simplified linear view. If these tools defaulted to a linear history and required special parameters in order to drill down into merged branches I feel that would improve the developer experience a bunch. You get an easy to understand summarized linear history and the ability to go deeper when you need to.

Of course, outside of the GUIs maybe, any changes in how people work with Git are very hard pushes from what I understand.

Third, GitHub could allow the PR author to set the merging strategy to be used in advance. Since each developer may have their own style, some with very intentful and effortful PR commits like your own @wesen, others who commit WIP things quickly and often, and some with a mix. This gives the author the ability to decide how they will ultimately formulate their PR. Obviously certain projects can still limit what PR merge strategies are available, and admins could still override the author's preset wishes. But the author at least has a chance to influence how the commits in the PR will be laid into the base branch.

But just to wrap up my argument, I think like any other tool squashing vs merging vs rebasing are options that teams can consider and make a decision on using given whatever their needs and circumstances are. There is no one size fits all approach to it.

Collapse
 
mindplay profile image
Rasmus Schultz

GitHub could allow the PR author to set the merging strategy to be used in advance. Since each developer may have their own style, some with very intentful and effortful PR commits like your own @wesen, others who commit WIP things quickly and often, and some with a mix. This gives the author the ability to decide how they will ultimately formulate their PR.

Alternatively, if your git history after working on a new feature is just "wip brrr whoops", don't even bother your team with having to see that in your PR: squash them on your own local system before opening the PR. 🙌

Collapse
 
alucas profile image
Antoine LUCAS • Edited

It feels so good to read a tech blog not written by a junior enthusiast with a degree trying to skip that 5 years learning period.
Thanks mate.
I despise git for always going against my way, and I am looking at alternatives CVS workflows. The whole fact that there is room for arguing is so toxic.

Collapse
 
dadyasasha profile image
Alex Pushkarev

I really liked the way you explained the way squash is working. But it seems that the main argument against squashing is that it just drops the history?

That's fair, but it doesn't mean squash is bad, it means one just need to know the cost, correct?

Collapse
 
wesen profile image
Manuel Odendahl

Yes. But most people argue that it "cleans up" history, because they are unaware that you can easily hide the right parent when printing out logs, for example. I find losing history a very high cost to avoid using a pretty-print flag.

Collapse
 
dadyasasha profile image
Alex Pushkarev

That's a very good argument. The other perspective to consider is that more and more people don't use git from command line so they see whatever they git tool shows, which may be unable to do pretty-print in a firts place

Collapse
 
xmarkclx profile image
Mark Lopez

One reason I'm for the squash merge camp hasn't been mentioned, so I think I should mention it here so @wesen can correct me.

We use squash merge to make it easy to revert a whole PR since it's just a commit.
It can be reverted after some time has passed easily by just reverting that commit.

What do you recommend for this so I can leave the wrong squash merge camp and follow the righteous path oh great @wesen .

Collapse
 
wesen profile image
Manuel Odendahl

you can do exactly the same for a merge commit by using git revert -m1. The squash merge commit and the merge commit both point to the same tree hash, they only differ wrt the parent commits. With a squash merge, you only have 1, so git revert knows "ok well you just want to revert to the parent". With the merge commit, you have 2, so you have to tell it "please use the left parent (aka, the parent on the main branch) to revert to". easy peasy!

Collapse
 
matthewpersico profile image
Matthew O. Persico

A number of commentors have made the distinction of squashing before the PR is submitted and after it is submitted. I contend it's an irrelevant distinction.

If you are in the squash-before-but-not-after crowd, I counter that once a PR starts being reviewed, and updated and re-reviewed, you're going to get a whole list of commits that you'll events up wanting to squash anyway.

My criteria for squashing is this: for each particular commit, if you cannot roll back that commit and have a working functional system, then there is no point in having that commits in your history; squash it out.

Now, if you think there is value in the various conversations surrounding those commits, then keep them around, off the main branch like this:

  • make a copy of the branch the PR is sitting on (please tell me you're not modifying your main branch directly...), naming the copy it archive/branchname
  • squash branchbname
  • merge the PR, putting a reference to archive/branchname in the PR's comments.
Collapse
 
andresbecker profile image
Andres Becker

Great post!