This article was originally written by Julie Kent on the Honeybadger Developer Blog.
If you're like me and have less than fifteen years of software engineering experience, the thought of a world without Git doesn't seem possible. When I started to research for this post, I almost fell out of my chair when I read that Git was created in 2005. It doesn't seem that long ago ... either that, or I'm simply getting old. :) Regardless, I often find myself being scared of certain Git commands. Do I rebase
, or do I merge
? What is the use case for a force push
? There have definitely been a few occasions when a wrong Git command turned into a big deal. So, I decided to bite the bullet and learn what is going on under that magical hood.
A brief history
Git is a version control system that is distributed, which means that it uses multiple local repositories, including a centralized repo and server. Before distributed systems, subversion (SVN) was a popular way to manage code version control. Unlike Git, it is centralized rather than distributed. With SVN, your data is stored on a central server, and any time you check it out, you're checking out a single version of the repository.
While most of us remember Git as the first distributed version control system, before Git, there was BitKeeper, a proprietary source control management system. Created in 1998, BitKeeper was spun up to solve some of the growing pains of Linux. It offered a free license for open-source projects, with the stipulation that developers could not create a competing tool while using BitKeeper plus one additional year. I'm sure you can guess what happened. In the early-to-mid 2000s, there were a plethora of license complaints, and in 2005, the free version of BitKeeper was removed. This prompted Linus Torvalds to create Git, which he named after a British slang word that means "unpleasant person." Linus Torvalds turned the project over to Junio Hamano (a major contributor) after its original v0.99 release, and Junio remains the core maintainer of the project. Fun Fact: The most recent version of Git was released on July 27th, 2020, and is version 2.28.
If you want to read more about BitKeeper, check out the Wikipedia page here -- it is no longer being developed.
What is Git, really?
While Git has morphed into a full-fledged version control management system, this wasn't the original intent. Linus Torvalds said the following on this topic:
In many ways, you can just see Git as a filesystem -- it's content-addressable, and it has a notion of versioning, but I really designed it coming at the problem from the viewpoint of a filesystem person (hey, kernels is what I do), and I actually have zero interest in creating a traditional SCM (source control management) system.
Side note: In case you're wondering what "content-addressable" means, it is a way to store information, so it can be retrieved based on content rather than location. Most traditional local and networked storage devices are location addressed.
Git has two data structures:
- a mutable index (i.e., a connection point between the object database and the working tree) and
- an immutable, append-only object database.
There are five types of objects:
- blob: this is the content of a file.
- tree: this is the equivalent of a directory
- commit: this links tree objects together to form a history
- tag: this is a container that contains a ref to another object, as well as other metadata
- packfile: zlib version compressed of various other objects
Each object has a unique name, which is a SHA-1 hash of its contents.
To better understand how all of this fits together, let's create a dummy project directory and run git init
.
Trying it out
Open your terminal, and create a new directory. Then, run git init
. You should then see something similar to the following output:
➜ Documents mkdir understanding-git
➜ understanding-git git init
Initialized empty Git repository in /Users/juliekent/Documents/understanding-git/.git/
➜ understanding-git git:(master)
I am sure you have done this many times but may not have really cared to know what was actually in the newly created .git
directory. Let's check it out. If you run ls -a
via your terminal, you will see the .git
directory. By default, it is a hidden directory, which is why you need the -a
flag. Place cd .git
into the directory, and then run ls
. You should see something like this:
➜ .git git:(master) ls
HEAD config description hooks info objects refs
We will be focusing on HEAD
, objects
, and refs
directories. We will also run some commands so that we have index
files, but this will come later. The description
file is only used by the GitWeb program. The config
file is pretty straight forward, as it contains project configuration options. The the info
directory keeps a global exclude file for ignored patterns you don't want to track, which is based on the .gitignore
file; I'm sure most of you are familiar with it.
The objects directory
Let's start with the objects
directory. To see what is created, run find .git/objects
. You should see the following:
➜ understanding-git git:(master) find .git/objects
.git/objects
.git/objects/pack
.git/objects/info
Next, let's create a file:
echo 'this is me' > myfile.txt
This creates a file named myfile.txt
containing this is me
.
Now, let's run the command git hash-object -w myfile.txt
.
Your output should be a random mix of numbers and letters -- this is a SHA-1 checksum hash. If you're not familiar with SHA-1, you can read more here.
Next, copy your SHA-1, and run the following command:
git cat-file -p (insert your SHA here)
You should see "this is me", the contents of your file that was created. Cool! This is how content-addressable Git objects work; you can think of it as a key-value store where the key is the SHA-1, and the value is the contents.
Let's write some new content to our original file:
echo 'this is not me' > myfile.txt
Then, run the hash-object
command again:
git hash-object -w myfile.txt
You now have two unique SHA-1s for both versions of this file. If you want further proof, run find .git/objects -type f
, and you should see both via your terminal window.
If you'd like to learn more about how other objects in Git work, I recommend following this tutorial.
The refs directory
Let's move onto refs. When running find .git/refs
, you should see the following output:
➜ understanding-git git:(master) ✗ find .git/refs
.git/refs
.git/refs/heads
.git/refs/tags
As we saw in the previous section about objects, we know that Git creates unique SHA-1 hashes for each one. Of course, we could run all of our Git commands utilizing each object's hash. For example, git show 123abcd
, but this is unreasonable and would require us to remember the hash of every object.
Refs to the rescue! A reference is simply a file stored in .git/refs
containing the hash of a commit object. Let's go ahead and commit our myfile.txt
, so we can better understand how refs work. Go ahead and run git add myfile.txt
and git commit -m 'first commit'
. You should see something like this:
➜ understanding-git git:(master) ✗ git add myfile.txt
➜ understanding-git git:(master) ✗ git commit -m 'first commit'
[master (root-commit) 40235ba] first commit
1 file changed, 1 insertion(+)
create mode 100644 myfile.txt
Now, let's navigate to the .git/refs/heads
directory by running cd .git/refs/heads
. From there, run cat master
. You should see the SHA-1. Finally, run git log -1 master
which should output something similar to the following:
commit Unique SHA-1 (HEAD -> master)
Author: Julie <jkent2910@gmail.com>
Date: Mon Aug 3 15:59:59 2020 -0500
first commit
Cool! As you can see, branches are simply just references. When we change the location of the master branch, all Git has to do is change the contents of the refs/heads/master
file. Likewise, creating a new branch creates a new reference file with the commit hash.
Helpful hint: If you ever want to see all references, run git show-ref
, which will list all references.
Sooooo, what is HEAD?!
HEAD
is a symbolic reference. You might wonder, when running git branch <branch>
, how Git knows the SHA-1 of the last commit. Well, the HEAD file is usually a symbolic reference to your current branch. You might be thinking to yourself, "You keep saying symbolic; what does that mean?" Great question! Symbolic means that it contains a pointer to another reference. If your head is spinning, I'm with you. It took me quite a bit of Googling and reading to finally understand what exactly HEAD
is. Here is a great analogy, pulled from this website
A good analogy would be a record player and the playback and record keys on it as the HEAD. As the audio starts recording, the tape moves ahead, moving past the head by recording onto it. The stop button stops the recording while still pointing to the point it last recorded, and the point that record head stopped is where it will continue to record again when Record is pressed again. If we move around, the head pointer moves to different places; however, when Record is pressed again, it starts recording from the point the head was pointing to when Record was pressed.
Go ahead and run: cat .git/HEAD
, and you should see something like this:
➜ understanding-git git:(master) cat .git/HEAD
ref: refs/heads/master
This makes sense because we are on the master branch. HEAD is, essentially, always going to be the reference to the last commit in the currently checked-out branch.
Helpful Tip: You can run git diff HEAD
to view the difference between HEAD and the working directory.
Wrapping up
We have covered a lot in this post! We've learned a bit of fun history regarding how Git came about and examined the main plumbing that makes all of the magic happen! If you want to continue to dive deeper into Git, as well as better understand how some of the common commands work, I highly recommend the book titled "Pro Git", which is available for free here.
Top comments (0)