In this article, we’ll dive into Git internals by going through a real example. If you don’t have your terminal open already, do so, fasten your seatbelts, and let’s go! 💨
Initializing a Git repository
You have probably already initialized an empty Git project using
git init, but did you ever wondered what does this command do?
Let’s create an empty folder, and initialize an empty Git project. From the official Git documentation,
This command creates an empty Git repository — basically a .git directory with subdirectories for
refs/tags, and template files. An initial
HEADfile that references the
HEADof the master branch is also created.
If we inspect the folder’s content, we’ll see the following structure:
$ tree -L 1 .git/ .git/ ├── HEAD ├── config ├── description ├── hooks ├── info ├── objects └── refs
Git is a key-value datastore
At its core, Git is a content-addressable filesystem. Huh? 🤔 Ok, Git is simply a key-value database. You insert any kind of content into a Git repository, for which Git will give you back a unique identifier (a key) that you can use to retrieve that content back.
Git uses the hash-object command to store values into the database:
Computes the object ID value for an object with specified type with the contents of the named file (which can be outside of the work tree), and optionally writes the resulting object into the object database. Reports its object ID to its standard output. When is not specified, it defaults to “blob”.
“blob” is nothing but a sequence of bytes. A Git blob contains the exact data as a file, but it’s stored in the Git key-value data store, while the “actual” file is stored on the file system.
Let’s create a blob:
$ echo hello | git hash-object --stdin -w ce013625030ba8dba906f756967f9e9ca394464a
We used the
-w flag to actually write the object into the object database, and not only display it (achieved by the
The value “hello” is the “value” in the Git data store, and the hash returned from the
hash-object function is in this case our key. We can now do the opposite operation to read the value by its key, using the
$ git cat-file -p ce013625030ba8dba906f756967f9e9ca394464a hello
We can check its type using the
$ git cat-file -t ce013625030ba8dba906f756967f9e9ca394464a blob
hash-object stores the data in the
.git/objects/ folder (AKA the object database). Let’s verify:
$ tree .git/objects/ .git/objects/ ├── ce │ └── 013625030ba8dba906f756967f9e9ca394464a ├── info └── pack
The hash suffix (under the “ce” directory) is the same as the one we got back from the
hash-object function, but it has a different prefix. Why? That’s because the parent folder name contains the first two characters of our key. Why? Because some file systems have limitations on numbers of sub-directories. So introducing this layer mitigates that problem.
Let’s save another object:
$ echo world | git hash-object --stdin -w cc628ccd10742baea8241c5924df992b5c019f71
As expected, we’ll now have two directories under
$ tree .git/objects/ .git/objects/ ├── cc │ └── 628ccd10742baea8241c5924df992b5c019f71 ├── ce │ └── 013625030ba8dba906f756967f9e9ca394464a ├── info └── pack
and again, the
cc folder, which has the key’s prefix, has the rest of the key in the contained file’s name.
Tree Objects 🌲
The next Git object we’ll investigate is the tree. This type solves the problem of storing the filename and allows storing a group of files together.
A tree object contains entries. Each entry is the SHA-1 of the blob, or subtree with its associated mode, type, and filename. Let’s check the
Reads standard input in non-recursive
ls-treeoutput format, and creates a tree object. The order of the tree entries is normalized by mktree so pre-sorting the input is not required. The object name of the tree object built is written to the standard output.
If you’re wondering about
ls-tree output format, it looks like:
<mode> SP <type> SP <object> TAB <file>
Let’s now associate the two blobs above:
$ printf '%s %s %s\t%s\n' \ 100644 blob ce013625030ba8dba906f756967f9e9ca394464a hello.txt \ 100644 blob cc628ccd10742baea8241c5924df992b5c019f71 world.txt | git mktree 88e38705fdbd3608cddbe904b67c731f3234c45b
mktree returns a key for the newly created tree object.
At this point, we can visualize our tree as follows:
88e38705fdbd3608cddbe904b67c731f3234c45b | +-------------|------------+ | | | | | | | | | | hello world ce013625030b cc628ccd1074
Let’s view the tree’s content:
$ git cat-file -p 88e38705fdbd3608cddbe904b67c731f3234c45b 100644 blob ce013625030ba8dba906f756967f9e9ca394464a hello.txt 100644 blob cc628ccd10742baea8241c5924df992b5c019f71 world.txt
and of course, the
.git/objects was updated accordingly:
$ tree .git/objects/ .git/objects/ ├── 88 │ └── e38705fdbd3608cddbe904b67c731f3234c45b ├── cc │ └── 628ccd10742baea8241c5924df992b5c019f71 ├── ce │ └── 013625030ba8dba906f756967f9e9ca394464a ├── info └── pack
So far we have not updated our index. To do so, we use the
Reads the tree information given by into the index, but does not actually update any of the files it “caches”. (see: git-checkout-index)
$ git read-tree 88e38705fdbd3608cddbe904b67c731f3234c45b $ git ls-files -s 100644 ce013625030ba8dba906f756967f9e9ca394464a 0 hello.txt 100644 cc628ccd10742baea8241c5924df992b5c019f71 0 world.txt
Note however, that we still don’t have the files on our file system, since we’re writing values directly to the Git datastore. In order to “checkout” the files, we’ll use the
git-checkout-index command, which copies files from the index to the working tree:
git checkout-index 0 -a
-a stands for “all”. Now we should be able to see our files:
$ ls hello.txt world.txt $ cat hello.txt hello $ cat world.txt world
In this article, we stored two files directly to the Git datastore. The files weren’t yet visible to our local filesystem. We’ve created a tree and associated the two “blobs” to it, and then we brought the files to our working directory using the git-checkout-index command.
Top comments (0)