DEV Community

root
root

Posted on

Git does not store Diffs

This is a common misconception around git. For a git beginner it makes logical sense to assume so, as Git exposes commit as diffs to the end user as that makes it easier to identify changes as shown in the image below.

Screenshot 2020-10-04 at 11.31.10 PM

But Git does not internally store diffs for each commit, it creates a snapshot of the files during each commit.
This means that if you create a file in one commit and modify the same file in second commit, Git will have 2 snapshots for the same file.

To show you how git internally stores these files, we are going to dig into the .git/ folder. Understanding how git handles files will give a better picture, check out my blog on Git Objects for the same.
TLDR : All the files we add to git repository are added as objects in the .git/objects/ repository.

(Please skip over to the Why ? section at the end for the reasoning behind using snapshots vs diffs)

Let's see this in action

$ mkdir gitinternals && cd gitinternals # create and cd into a dir
$ git init # initialize git
Initialized empty Git repository in /Users/home/gitinternals/.git/

Now that we have a fresh git repository lets create a file and commit it.

$ echo "# Hello world" > README.md
$ git add README.md
$ git commit -m "Add readme file"
[master (root-commit) 0948529] Add readme file
 1 file changed, 1 insertion(+)
 create mode 100644 README.md

Let's take a look at the git object directory, it will have the readme file stored.

$  tree -a .git/objects/
.git/objects/
├── 09
│   └── 4852928af802dfe0f463359c7ade3f7a21fffa
├── 71
│   └── 6ed1421c738a75abe6e0c4812ad4aacee0e11a
├── a5
│   └── ef91ee14be786131cbecfd2eb8c7fef8a2510d
├── info
└── pack

The objects directory has three objects, we can check the type of object file with the cat-file -t plumbing command.
with that we can see the 716e is the Blob file, which should be our README.md file, as we only have one blob file in our repository. using cat-file -p we can see the contents of the object file.
Please feel free to refer back to Git Objects, if you want to know more on Git objects.

$  git cat-file -t 716e
blob

$ git cat-file -p 716e
# Hello world

Let's modify the readme to add a . at the end of the file and create a commit again

$ echo "." >> README.md
$ git add README.md
$ git commit -m "Update readme"
[master ccab425] Update readme
 1 file changed, 1 insertion(+)

Checking the objects directory

$ tree -a .git/objects/
.git/objects/
├── 09
│   └── 4852928af802dfe0f463359c7ade3f7a21fffa
├── 28
│   └── af00ee0e3e44d7806dc1c2d7f1a9c9d75cfd8e
├── 5f
│   └── a99c8ea90f41ae4601f92ea7475832e6fb773d
├── 71
│   └── 6ed1421c738a75abe6e0c4812ad4aacee0e11a
├── a5
│   └── ef91ee14be786131cbecfd2eb8c7fef8a2510d
├── cc
│   └── ab425bf34937b0e02ed807724af39812e8988b
├── info
└── pack

there are a few more objects in the directory. to get the blob file we can check each file for the type and figure out which one is the new version of README.md but there is a better methodological approach to it.

we can take the current commit hash and check what tree that has points to and from the tree hash we should be able to get the blob hash, Please refer to Git Objects in case if you are unfamiliar with commits and trees objects.

# get hash of current commit
$ git rev-parse --short HEAD
ccab425
# check content of the commit to get tree hash
$ git cat-file -p ccab425
tree 28af00ee0e3e44d7806dc1c2d7f1a9c9d75cfd8e
parent 094852928af802dfe0f463359c7ade3f7a21fffa
author root <root@email.com> 1601836557 +0530
committer root <root@email.com> 1601836557 +0530

Update readme
# check content of tree `28af00`
$ git cat-file -p 28af00
100644 blob 5fa99c8ea90f41ae4601f92ea7475832e6fb773d    README.md

the hash of new README.md snapshot is 5fa99c, lets inspect the old and new snapshot content.

# New snapshot hash
$ git cat-file -p 5fa99c
# Hello world
.

# Old snapshot hash
git cat-file -p 716ed1
# Hello world

Why ?

This might seem un efficient to store each version of file as snapshot as storing just the diffs would be most efficient use of storage. so it is only natural to ask why ?

Storing diffs and applying diffs on top of base file version can become pretty computationally expensive when you are on large projects with thousands or even millions of files.
The diff based approach is used by SVN and there are common issues with large projects taking hours just to checkout as CPU and IO becomes bottleneck to apply those diffs.

And SVN was popular when storage was costlier than compute but the tables have changed now. storage is much more cheaper in this day and age, which makes Git the perfect choice for version control.

Git has some optimization built in for compressing objects so that they take lesser storage on disk, which we will cover in an upcoming blog :).

Top comments (0)