Rik Tonnard for Kabisa Software Artisans

Posted on Jul 5, 2019 • Originally published at theguild.nl on May 16, 2019

Git: The Object Database

#git #versioncontrol #objectdatabase

Nowadays, Git is probably the most used version control system. Committing, pushing, merging: most developers know how to do this, but things like resetting, cherry-picking and rebasing can be difficult to grasp completely. Git has the power to change and delete files in your working directory, so it can be scary to try something without being absolutely sure what it does¹.

Part of this is because the Git CLI can be confusing at times. But it is important to know that Git only provides a thin layer of abstraction on top of its internal data structures. Hence learning the data structures of Git can help you with using Git effectively.

That's what this series of blog posts is all about: learning about the internals of Git. This will help you understand Git better and it makes using Git a lot less daunting.

.git

The most important folder we will be looking at is the .git folder. This folder is where Git stores all its data that is project specific. I believe a good way to learn what Git does it to just inspect it. So, follow along and set up a new repository.

$ mkdir my-new-repository
$ cd my-new-repository
$ git init

Git has now created this .git folder, so let's take a look at it:

$ cd .git
$ ls -1F
HEAD
config
description
hooks/
info/
objects/
refs/

Three files have been created and four folders. We will only look at the objects folder for now.

Building a file system using blobs and trees 🌳

The objects folder only contains two empty folders right now. Let's ignore those for now and discuss what Git considers objects. Objects are Gits most basic data structure and there are several types of them. Every object can be referenced using its hash, which is often represented as a 40 character hexadecimal number. We will discuss this hash in more detail later on, but for now it is enough to know that every object has its own unique hash.

There are four types of objects: blobs, trees, commits and (annotated) tags.

Blobs are the simplest type of object. The contents of this object are just binary data. As far as Git is concerned, these objects are just random 1s and 0s and they have no special meaning. Most of the blobs that are stored will contain the contents of a file.

A perfectly valid example of what a blob might contain is this:

Hello, world!

The next type of object is a tree. A tree is a list of named references to other objects, these references can be to blobs, other trees, or commits. A reference to a commit is only used in the case of Git submodules², so we'll ignore those. Trees are very useful for representing folder contents, and that is exactly what Git uses them for.

An example of what a tree might look like is this:

100644 blob 2769292d44c669aebc3959fe4852d7b661302fa4    LICENSE
100644 blob a5c19667710254f835085b99726e523457150e03    README
040000 tree 610b81880e04b3fa39470635e0a6204474373c3d    spec
040000 tree 5141f7c9c700f90680739107c9db41448643ff2b    src

The first column is just some metadata on that specific entry. You might recognize the file permissions as the last 3 digits in the first two entries, for instance. The second column contains the type of object and the third column contains the actual reference, the hash, to the object. The last column is the name of the object in this specific tree, so that would be the name of a file in a folder.

Notice how we have these flat objects, but we can represent a nested directory structure with these objects. Trees can contain references to other trees, just like directories contain other directories. Using just these two types of objects, Git can create an entire folder structure. How convenient! We can represent the full state of the repository at a single point of time with these two object types.

Adding a timeline using commits

So we can represent the state of the entire repository at a single point of time. But in order to have a version control system, we want to have representations of the repository at multiple points in time. That is where commits come in, the third type of object.

A commit can contain a lot of data, but in general at the least the following information will be part of it:

Zero or more references to parent commits
Author and committer information
A reference to a tree
A commit message

This is what a commit might look like:

parent eb4b5d655a77bd84deb9062dd86e1a381082fd2e
tree eb09d159141f97da0bcd74093f821ab50e092be8
author John Doe <john.doe@example.com> 1542739074 +0100
committer John Doe <john.doe@example.com> 1542739343 +0100

Start implementation of app

A more thorough description of all the changes that I made in this commit.

The first line is a reference to the parent commit. This is what this commit is based on, and this is what allows you to go back in time to the previous commit. Since the parent commit will contain a reference to its parent commit, you can go back all the way to the very first commit in the repository³.

The second line references a tree. This tree will contain the root of your repository at the time of the commit.

The third and fourth line, the author and committer will be the same most of the time. If you have ever cherry picked a commit, you have created a commit in which this was not the case. When cherry picking, the author of the commit stays the same, but the committer will always be you.

The rest of the commit is the commit message.

Commits are full snapshots

Note that since a commit contains a reference to a tree, which in turn contains references to subtrees and blobs, and those subtrees have their subtrees, etc., the commit is a full snapshot of the repository at that point in time.

In general, when tracking changes, you have two ways to represent them: either you save these snapshots, like Git does, or you save the changes between two commits.

In the former case, the difference between two commits has to be calculated (and this is what Git does when you use git diff). In the latter case, you already have the differences available, but getting the state of the repository for a specific point in time requires calculation. All changes in all previous commits have to be added to each other to get to the final state. So that's a trade-off, and Git uses the former. For file size concerns and performance when diffing this might seem a like a bad choice, but in the next blog post we will learn about how Git mitigates these issues.

Don't forget tags

The final type of object is the annotated tag. When you create an annotated tag, you write a message. This message will be stored along with a reference to the object you are tagging⁴ and some other data.

object 039960550b55fe07a41a9f1218b6624a4eed951f
type commit
tag 1.0.0
tagger John Doe <john.doe@example.com> 1543401469 +0100

1.0.0

There will be a lightweight tag that references this annotated tag. In the next part of this series we will go into more detail what annotated tags and lightweight tags actually are, and how they differ.

Putting it all together

So using just blobs, trees and commits, we already have a versioned file system! Let's get back to that repository we just created and see what happens when we create a commit. First, we'll create a new file and add it to the repository.

$ echo 'Hello, world' > README
$ git add README

When we inspect the objects folder, you will see that just adding the file will create an object already:

$ tree -fi --noreport .git/objects
.git/objects
.git/objects/a5
.git/objects/a5/c19667710254f835085b99726e523457150e03
.git/objects/info
.git/objects/pack

Our new object has the hash a5c19667710254f835085b99726e523457150e03. The reason why this a5 folder was created is because some file systems have issues with a lot of files in a single directory, so it is better to split objects into multiple folders. And believe me, most repositories will have a lot of objects.

We can use git cat-file to show the type and contents of the object.

$ git cat-file -t a5c19667710254f835085b99726e523457150e03
blob
$ git cat-file -p a5c19667710254f835085b99726e523457150e03
Hello, world

So this object is a blob containing the contents of the file we added. Let's continue with our first commit:

$ git commit -m 'Initial commit'
[master (root-commit) f816d47] Initial commit
 1 file changed, 1 insertion(+)
  create mode 100644 README

And when we look at the objects, we can see two more objects are created:

$ tree -fi --noreport .git/objects
.git/objects
.git/objects/60
.git/objects/60/85225d73e7636ca5ab1b271392ffb967839a3b
.git/objects/a5
.git/objects/a5/c19667710254f835085b99726e523457150e03
.git/objects/f8
.git/objects/f8/16d47858b48530e59b15db4eb8a340959d0af6
.git/objects/info
.git/objects/pack

Note that hashes can be different on your system. Let's look at the first new object:

$ git cat-file -t 6085225d73e7636ca5ab1b271392ffb967839a3b
tree
$ git cat-file -p 6085225d73e7636ca5ab1b271392ffb967839a3b
100644 blob a5c19667710254f835085b99726e523457150e03    README

That looks like it is a tree. As you can see, it contains the name of file in this specific directory as well as a reference to the blob object we already created when we added the file to the repository using git add.

That last object must be the commit, and if we use git cat-file we can see this is true:

$ git cat-file -t f816d47858b48530e59b15db4eb8a340959d0af6
commit
$ git cat-file -p f816d47858b48530e59b15db4eb8a340959d0af6
tree 6085225d73e7636ca5ab1b271392ffb967839a3b
author John Doe <john.doe@example.com> 1556703988 +0200
committer John Doe <john.doe@example.com> 1556703988 +0200

Initial commit

It also has a reference to the correct tree. This commit has no reference to a parent commit, since it is the very first commit.

So we've got these three objects in our object database which are connected to each other:

Conclusion

This is what is stored in the object database that Git uses to store any kind of data. It is a lot simpler than you might have expected and as we will see in the next blog post, it is also very powerful and efficient.

In a next blog post, we will take a look at how Git comes up with these hashes for objects. We'll also take a look at what branches and tags are.

As long as a version of a file is committed, you are very unlikely to lose it and you shouldn't be afraid to. You might lose a reference to a commit, but you can often use git-reflog or git-fsck to find it back. ↩
To learn more about submodules, check out the chapter in the Git book on submodules. ↩
There can be multiple commits in a repository without a parent commit. Check out the docs on git checkout --orphan if you want to do this. ↩
Despite the fact that almost every tag points to a commit (or an annotated tag that points to a commit), this is not strictly necessary. A tag can also point to blobs and trees, so you might use tags to keep a reference to blob containing whatever data you like. ↩