konrad_126

Posted on Apr 23, 2020 • Edited on Apr 27, 2020 • Originally published at Medium

Understanding Git - Data Model

#git #begginers

As software developers, we use a lot of tools. Many of them come with an intuitive User Interface which abstracts the internal complexity of the domain they are operating on. This allows us to go fast and learn those tools by using them.

For better or worse, git is not one of those tools. The abstraction git's User Interface gives you is very leaky so to become better with git you must spend (some) time learning how it works internally.

In this Understanding Git series, we will cover git’s internals (we will not go into git’s source code don’t worry) and the first thing on that list is git’s heart and soul — the data model.

It's all about .git

We'll start by initializing a git repository:

git init

Git tells us it has created a .git directory in our project’s directory so let’s take a quick peek into it:

$ tree .git/

.git/
├── HEAD
├── config
├── description
├── hooks
│   ├── applypatch-msg.sample
│   ├── commit-msg.sample
│   ├── post-update.sample
│   ├── pre-applypatch.sample
│   ├── pre-commit.sample
│   ├── pre-push.sample
│   ├── pre-rebase.sample
│   ├── pre-receive.sample
│   ├── prepare-commit-msg.sample
│   └── update.sample
├── info
│   └── exclude
├── objects
│   ├── info
│   └── pack
└── refs
    ├── heads
    └── tags8 directories, 14 files

Some of these files and directories may sound familiar (particularly HEAD) but for now, focus on the .git/objects directory. Right now it's empty, but we will change that in a moment.

Let’s add an index.php file

touch index.php

fill it with some content

<?php
echo "Hello World";

and a README.md file

touch README.md

with some content as well:

# Description
This is my hello world project

Now let’s stage and commit them:

git add .
git commit -m "Initial Commit"

OK, nothing special here, adding and committing — we’ve all “been there, done that”.

If we look back at the .git directory we can see that the .git/objects directory now contains some files and subdirectories (bear in mind they will have different names on your computer!):

├── objects
│   ├── 5d
│   │   └── 92c127156d3d86b70ae41c73973434bf4bf341
│   ├── a6
│   │   └── dbf05551541dc86b7a49212b62cfe1e9bb14f2
│   ├── cf
│   │   └── 59e02c3d2a2413e2da9e535d3c116af1077906
│   ├── f8
│   │   └── 9e64bdfcc08a8b371ee76a74775cfe096655ce
│   ├── info
│   └── pack

Every object in git has a so-called checksum header (the unique identifier of an object) and the first two characters of that checksum are used as a directory name while the rest is used as a file (object) name. Let's look at what these objects are.

Blobs, trees, and ...

The first kind of object that git creates when we commit some file(s) are blob objects. Git uses them to represent the content of files. In our case there is two of them, one for each file we committed:

They contain the full content of our files, so you can think of them as snapshots of our files (at the time of the commit). To generate the checksum header git takes the content of an object, feeds it to a hashing function and the output is the checksum header. This is why it also serves as a unique identifier of an object.

The next kind of object git creates are tree objects. They are used to represent the project's folder structure and in our simple example, git needs only one tree object. It contains a list of all files in our project with pointers to their blob objects:

Lastly, git creates a commit object. It contains some metadata data (author, time..) and a pointer to its tree object:

Content of our .git/objects directory it should make more sense now:

├── objects
│   ├── 5d
│   │   └── 92c127156d3d86b70ae41c73973434bf4bf341
│   ├── a6
│   │   └── dbf05551541dc86b7a49212b62cfe1e9bb14f2
│   ├── cf
│   │   └── 59e02c3d2a2413e2da9e535d3c116af1077906
│   ├── f8
│   │   └── 9e64bdfcc08a8b371ee76a74775cfe096655ce
│   ├── info
│   └── pack

Using git log we can see our commit history:

commit a6dbf05551541dc86b7a49212b62cfe1e9bb14f2 
Author: zspajich <zspajich@gmail.com>
Date:   Tue Jan 23 13:31:43 2018 +0100Initial Commit

By knowing the naming convention we mentioned earlier we can locate this commit object in .git/object :

├── objects
│   ├── a6
│   │   └── dbf05551541dc86b7a49212b62cfe1e9bb14f2

To display its content we can’t simply use the cat command since these are not plain text files but git has a cat-file command we can use instead:

git cat-file commit a6dbf05551541dc86b7a49212b62cfe1e9bb14f2

Now we see the content of the commit object:

tree f89e64bdfcc08a8b371ee76a74775cfe096655ce
author zspajich <zspajich@gmail.com> 1516710703 +0100
committer zspajich <zspajich@gmail.com> 1516710703 +0100Initial Commit

First line is the pointer to a tree object and to examine it’s content we can use git ls-tree command:

git ls-tree f89e64bdfcc08a8b371ee76a74775cfe096655ce

As expected it does contain a list of our files with pointers to blob objects:

100644 blob cf59e02c3d2a2413e2da9e535d3c116af1077906 README.md
100644 blob 5d92c127156d3d86b70ae41c73973434bf4bf341 index.php

Let's look into the blob object representing index.php using the cat-file command:

git cat-file blob 5d92c127156d3d86b70ae41c73973434bf4bf341

Sure enough, it has the same content as our index.php file:

<?
echo "Hello World!"

There you go. Now you know what happens when committing files.

Let's see what happens if we now edit (add some code magic) and commit our index.php file:

Git creates another blob object (a new snapshot) to represent the new content of index.php. As for README.md, since it didn't change, git can reuse the existing blog for it (we'll see it in a moment).

When creating the tree **object, git updates the pointer for index.php to its new blob while the README.md pointer stays the same:

As before, the commit object has a pointer to the tree object but also a pointer to its parent commit object because every commit except the first one has at least one parent:

Now that we know how git handles file adding and editing, the only thing remaining is file deletion. What if we delete our index.php file?

It’s rather simple — git deletes the file entry (filename with a pointer to its blob object) from the tree object. In other words, our commit’s tree object no longer has a pointer to a blob object representing index.php (but that blob object is still there in .git/objects)

Nested folders

In our real-life projects, the folder structure is much more elaborate than in this simple example. As said, tree objects represent the folder structure of your project, and the same way folders can be nested tree objects can also be nested (point to other tree objects). For example:

Here, our project base folder has one README.md file and one sub-directory app which has two files ( app.php and app_dev.php).

So there you have it - git's data model. Blob objects represent the content of files, tree objects represent the folder structure of the project, while commit objects contain metadata and have pointers to their parents.

In the next post, we'll take a look at branching - what branches are and why having a bunch of them is very cheap in git.

Latest comments (3)

datadr1ven • Nov 22 '24 • Edited

This is a very clear and useful description of rudimentary but fundamental git internals, thank you

Madis Nõmme • May 13 '20

This is pretty much copy-pasted from medium.com/hackernoon/https-medium... have some decency man.

konrad_126 • May 13 '20

Yes, it is because I am the author of that blog post man :)

I have refreshed the text a bit and images also, but yes it is the same post. I added a canonical_url property to the post, I don't know if it is a custom here to further emphasize the fact it is a re-post?