Maroun Maroun

Posted on Sep 28, 2020

A Short Journey to Git Internals

#git

In this article, we’ll dive into Git internals by going through a real example. If you don’t have your terminal open already, do so, fasten your seatbelts, and let’s go! 💨

Initializing a Git repository

You have probably already initialized an empty Git project using git init, but did you ever wondered what does this command do?

Let’s create an empty folder, and initialize an empty Git project. From the official Git documentation, git init:

This command creates an empty Git repository — basically a .git directory with subdirectories for objects, refs/heads, refs/tags, and template files. An initial HEAD file that references the HEAD of the master branch is also created.

If we inspect the folder’s content, we’ll see the following structure:

$ tree -L 1 .git/
.git/
├── HEAD
├── config
├── description
├── hooks
├── info
├── objects
└── refs

Git is a key-value datastore

At its core, Git is a content-addressable filesystem. Huh? 🤔 Ok, Git is simply a key-value database. You insert any kind of content into a Git repository, for which Git will give you back a unique identifier (a key) that you can use to retrieve that content back.

Git uses the hash-object command to store values into the database:

Computes the object ID value for an object with specified type with the contents of the named file (which can be outside of the work tree), and optionally writes the resulting object into the object database. Reports its object ID to its standard output. When is not specified, it defaults to “blob”.

“blob” is nothing but a sequence of bytes. A Git blob contains the exact data as a file, but it’s stored in the Git key-value data store, while the “actual” file is stored on the file system.

Let’s create a blob:

$ echo hello | git hash-object --stdin -w
ce013625030ba8dba906f756967f9e9ca394464a

We used the -w flag to actually write the object into the object database, and not only display it (achieved by the --stdin flag).

The value “hello” is the “value” in the Git data store, and the hash returned from the hash-object function is in this case our key. We can now do the opposite operation to read the value by its key, using the git-cat-file command:

$ git cat-file -p ce013625030ba8dba906f756967f9e9ca394464a
hello

We can check its type using the -t flag:

$ git cat-file -t ce013625030ba8dba906f756967f9e9ca394464a
blob

git hash-object stores the data in the .git/objects/ folder (AKA the object database). Let’s verify:

$ tree .git/objects/
.git/objects/
├── ce
│   └── 013625030ba8dba906f756967f9e9ca394464a
├── info
└── pack

The hash suffix (under the “ce” directory) is the same as the one we got back from the hash-object function, but it has a different prefix. Why? That’s because the parent folder name contains the first two characters of our key. Why? Because some file systems have limitations on numbers of sub-directories. So introducing this layer mitigates that problem.

Let’s save another object:

$ echo world | git hash-object --stdin -w
cc628ccd10742baea8241c5924df992b5c019f71

As expected, we’ll now have two directories under .git/objects/:

$ tree .git/objects/
.git/objects/
├── cc
│   └── 628ccd10742baea8241c5924df992b5c019f71
├── ce
│   └── 013625030ba8dba906f756967f9e9ca394464a
├── info
└── pack

and again, the cc folder, which has the key’s prefix, has the rest of the key in the contained file’s name.

Tree Objects 🌲

The next Git object we’ll investigate is the tree. This type solves the problem of storing the filename and allows storing a group of files together.
A tree object contains entries. Each entry is the SHA-1 of the blob, or subtree with its associated mode, type, and filename. Let’s check the git-mktree documentation:

Reads standard input in non-recursive ls-tree output format, and creates a tree object. The order of the tree entries is normalized by mktree so pre-sorting the input is not required. The object name of the tree object built is written to the standard output.

If you’re wondering about ls-tree output format, it looks like:

<mode> SP <type> SP <object> TAB <file>

Let’s now associate the two blobs above:

$ printf '%s %s %s\t%s\n' \
    100644 blob ce013625030ba8dba906f756967f9e9ca394464a hello.txt \
    100644 blob cc628ccd10742baea8241c5924df992b5c019f71 world.txt |
  git mktree
88e38705fdbd3608cddbe904b67c731f3234c45b

mktree returns a key for the newly created tree object.

At this point, we can visualize our tree as follows:

             88e38705fdbd3608cddbe904b67c731f3234c45b  
                                |                   
                  +-------------|------------+             
                  |                          |        
                  |                          |        
                  |                          |          
                  |                          |       
                  |                          |
                hello                       world        
            ce013625030b                 cc628ccd1074

Let’s view the tree’s content:

$ git cat-file -p 88e38705fdbd3608cddbe904b67c731f3234c45b
100644 blob ce013625030ba8dba906f756967f9e9ca394464a hello.txt
100644 blob cc628ccd10742baea8241c5924df992b5c019f71 world.txt

and of course, the .git/objects was updated accordingly:

$ tree .git/objects/
.git/objects/
├── 88
│   └── e38705fdbd3608cddbe904b67c731f3234c45b
├── cc
│   └── 628ccd10742baea8241c5924df992b5c019f71
├── ce
│   └── 013625030ba8dba906f756967f9e9ca394464a
├── info
└── pack

So far we have not updated our index. To do so, we use the git-read-tree command:

Reads the tree information given by into the index, but does not actually update any of the files it “caches”. (see: git-checkout-index[1])

$ git read-tree 88e38705fdbd3608cddbe904b67c731f3234c45b
$ git ls-files -s
100644 ce013625030ba8dba906f756967f9e9ca394464a 0 hello.txt
100644 cc628ccd10742baea8241c5924df992b5c019f71 0 world.txt

Note however, that we still don’t have the files on our file system, since we’re writing values directly to the Git datastore. In order to “checkout” the files, we’ll use the git-checkout-index command, which copies files from the index to the working tree:

git checkout-index 0 -a

The -a stands for “all”. Now we should be able to see our files:

$ ls
hello.txt world.txt
$ cat hello.txt
hello
$ cat world.txt
world

Summary 📝

In this article, we stored two files directly to the Git datastore. The files weren’t yet visible to our local filesystem. We’ve created a tree and associated the two “blobs” to it, and then we brought the files to our working directory using the git-checkout-index command.

DEV Community

A Short Journey to Git Internals

Initializing a Git repository

Git is a key-value datastore

Tree Objects 🌲

Summary 📝

Top comments (0)

Read next

uuu

Definition of Done, Definition of Ready and Acceptance Criteria are not the same darn thing

Efficient Proxmox Backups: How to Use NAKIVO Backup & Replication

SEO Tools for Testing and Techniques