Honeybadger Staff for Honeybadger

Posted on Apr 20, 2021 • Originally published at honeybadger.io

How Does Git Work?

#codenewbie #github

This article was originally written by Julie Kent on the Honeybadger Developer Blog.

If you're like me and have less than fifteen years of software engineering experience, the thought of a world without Git doesn't seem possible. When I started to research for this post, I almost fell out of my chair when I read that Git was created in 2005. It doesn't seem that long ago ... either that, or I'm simply getting old. :) Regardless, I often find myself being scared of certain Git commands. Do I rebase, or do I merge? What is the use case for a force push? There have definitely been a few occasions when a wrong Git command turned into a big deal. So, I decided to bite the bullet and learn what is going on under that magical hood.

A brief history

Git is a version control system that is distributed, which means that it uses multiple local repositories, including a centralized repo and server. Before distributed systems, subversion (SVN) was a popular way to manage code version control. Unlike Git, it is centralized rather than distributed. With SVN, your data is stored on a central server, and any time you check it out, you're checking out a single version of the repository.

While most of us remember Git as the first distributed version control system, before Git, there was BitKeeper, a proprietary source control management system. Created in 1998, BitKeeper was spun up to solve some of the growing pains of Linux. It offered a free license for open-source projects, with the stipulation that developers could not create a competing tool while using BitKeeper plus one additional year. I'm sure you can guess what happened. In the early-to-mid 2000s, there were a plethora of license complaints, and in 2005, the free version of BitKeeper was removed. This prompted Linus Torvalds to create Git, which he named after a British slang word that means "unpleasant person." Linus Torvalds turned the project over to Junio Hamano (a major contributor) after its original v0.99 release, and Junio remains the core maintainer of the project. Fun Fact: The most recent version of Git was released on July 27th, 2020, and is version 2.28.

If you want to read more about BitKeeper, check out the Wikipedia page here -- it is no longer being developed.

What is Git, really?

While Git has morphed into a full-fledged version control management system, this wasn't the original intent. Linus Torvalds said the following on this topic:

In many ways, you can just see Git as a filesystem -- it's content-addressable, and it has a notion of versioning, but I really designed it coming at the problem from the viewpoint of a filesystem person (hey, kernels is what I do), and I actually have zero interest in creating a traditional SCM (source control management) system.

Side note: In case you're wondering what "content-addressable" means, it is a way to store information, so it can be retrieved based on content rather than location. Most traditional local and networked storage devices are location addressed.

Git has two data structures:

a mutable index (i.e., a connection point between the object database and the working tree) and
an immutable, append-only object database.

There are five types of objects:

blob: this is the content of a file.
tree: this is the equivalent of a directory
commit: this links tree objects together to form a history
tag: this is a container that contains a ref to another object, as well as other metadata
packfile: zlib version compressed of various other objects

Each object has a unique name, which is a SHA-1 hash of its contents.

To better understand how all of this fits together, let's create a dummy project directory and run git init.

Trying it out

Open your terminal, and create a new directory. Then, run git init. You should then see something similar to the following output:

➜  Documents mkdir understanding-git
➜  understanding-git git init
Initialized empty Git repository in /Users/juliekent/Documents/understanding-git/.git/
➜  understanding-git git:(master)

I am sure you have done this many times but may not have really cared to know what was actually in the newly created .git directory. Let's check it out. If you run ls -a via your terminal, you will see the .git directory. By default, it is a hidden directory, which is why you need the -a flag. Place cd .git into the directory, and then run ls. You should see something like this:

➜  .git git:(master) ls
HEAD        config      description hooks       info        objects     refs

We will be focusing on HEAD, objects, and refs directories. We will also run some commands so that we have index files, but this will come later. The description file is only used by the GitWeb program. The config file is pretty straight forward, as it contains project configuration options. The the info directory keeps a global exclude file for ignored patterns you don't want to track, which is based on the .gitignore file; I'm sure most of you are familiar with it.

The objects directory

Let's start with the objects directory. To see what is created, run find .git/objects. You should see the following:

➜  understanding-git git:(master) find .git/objects
.git/objects
.git/objects/pack
.git/objects/info

Next, let's create a file:

echo 'this is me' > myfile.txt

This creates a file named myfile.txt containing this is me.

Now, let's run the command git hash-object -w myfile.txt.

Your output should be a random mix of numbers and letters -- this is a SHA-1 checksum hash. If you're not familiar with SHA-1, you can read more here.

Next, copy your SHA-1, and run the following command:

git cat-file -p (insert your SHA here)

You should see "this is me", the contents of your file that was created. Cool! This is how content-addressable Git objects work; you can think of it as a key-value store where the key is the SHA-1, and the value is the contents.

Let's write some new content to our original file:

echo 'this is not me' > myfile.txt

Then, run the hash-object command again:

git hash-object -w myfile.txt

You now have two unique SHA-1s for both versions of this file. If you want further proof, run find .git/objects -type f, and you should see both via your terminal window.

If you'd like to learn more about how other objects in Git work, I recommend following this tutorial.

The refs directory

Let's move onto refs. When running find .git/refs, you should see the following output:

➜  understanding-git git:(master) ✗ find .git/refs
.git/refs
.git/refs/heads
.git/refs/tags

As we saw in the previous section about objects, we know that Git creates unique SHA-1 hashes for each one. Of course, we could run all of our Git commands utilizing each object's hash. For example, git show 123abcd, but this is unreasonable and would require us to remember the hash of every object.

Refs to the rescue! A reference is simply a file stored in .git/refs containing the hash of a commit object. Let's go ahead and commit our myfile.txt, so we can better understand how refs work. Go ahead and run git add myfile.txt and git commit -m 'first commit'. You should see something like this:

➜  understanding-git git:(master) ✗ git add myfile.txt
➜  understanding-git git:(master) ✗ git commit -m 'first commit'
[master (root-commit) 40235ba] first commit
 1 file changed, 1 insertion(+)
 create mode 100644 myfile.txt

Now, let's navigate to the .git/refs/heads directory by running cd .git/refs/heads. From there, run cat master. You should see the SHA-1. Finally, run git log -1 master which should output something similar to the following:

commit Unique SHA-1 (HEAD -> master)
Author: Julie <jkent2910@gmail.com>
Date:   Mon Aug 3 15:59:59 2020 -0500

   first commit

Cool! As you can see, branches are simply just references. When we change the location of the master branch, all Git has to do is change the contents of the refs/heads/master file. Likewise, creating a new branch creates a new reference file with the commit hash.

Helpful hint: If you ever want to see all references, run git show-ref, which will list all references.

Sooooo, what is HEAD?!

HEAD is a symbolic reference. You might wonder, when running git branch <branch>, how Git knows the SHA-1 of the last commit. Well, the HEAD file is usually a symbolic reference to your current branch. You might be thinking to yourself, "You keep saying symbolic; what does that mean?" Great question! Symbolic means that it contains a pointer to another reference. If your head is spinning, I'm with you. It took me quite a bit of Googling and reading to finally understand what exactly HEAD is. Here is a great analogy, pulled from this website

A good analogy would be a record player and the playback and record keys on it as the HEAD. As the audio starts recording, the tape moves ahead, moving past the head by recording onto it. The stop button stops the recording while still pointing to the point it last recorded, and the point that record head stopped is where it will continue to record again when Record is pressed again. If we move around, the head pointer moves to different places; however, when Record is pressed again, it starts recording from the point the head was pointing to when Record was pressed.

Go ahead and run: cat .git/HEAD, and you should see something like this:

➜  understanding-git git:(master) cat .git/HEAD
ref: refs/heads/master

This makes sense because we are on the master branch. HEAD is, essentially, always going to be the reference to the last commit in the currently checked-out branch.

Helpful Tip: You can run git diff HEAD to view the difference between HEAD and the working directory.

Wrapping up

We have covered a lot in this post! We've learned a bit of fun history regarding how Git came about and examined the main plumbing that makes all of the magic happen! If you want to continue to dive deeper into Git, as well as better understand how some of the common commands work, I highly recommend the book titled "Pro Git", which is available for free here.

Build Secure, Ship Fast

Discover best practices to secure CI/CD without slowing down your pipeline.

Top comments (0)

Is Your CI/CD Server a Prime Target for Attack?

57% of organizations have suffered from a security incident related to DevOps toolchain exposures. It makes sense—CI/CD servers have access to source code, a highly valuable asset. Is yours secure? Check out nine practical tips to protect your CI/CD.

Learn more