Make original Git by Rust! (Analyze section)

Hello Dev community!

I'm noshishi, a apprentice engineer in Tokyo.
This article is about understanding Git from the inside by creating a simple program that add and commit.

But it's a very long story, so I'll post the development section separately!


The starting point is 'If I could understand git, I could make it?!!'

I took this opportunity to try out a new programming language, so I decided to try Rust this time. The repository I actually created is My original git nss. The quality of the code isn't quite there yet and many parts are still incomplete, but you can do a straight line of local development!

If you give me a star, I'll be happy to fly, and of course I'll be waiting for your contributions! Feel free to touch this repository any way you like!

Please forgive us for not being able to explain some of the details in this article alone. Also, we use Rust for development, but Python for the stage where we uncover Git's internals!


Git Inside

First, we will unpack how Git handles data, based on the official documentation.
The Git command system is very complex.
But, Git data structure is very simple!

Where is repository

A repository is the directory under the control of Git, and the folder .git in the directory created by init or clone is the actual state of the repository.

Let's put an empty folder called project under Git's control.

$ pwd
$ ls -a
# nothing yet
$ git init
Initialized empty Git repository in /home/noshishi/project/.git/
$ ls -a
This .git directory consists of the following.

├── HEAD
├── (index)  // Not created by `init`!
├── config
├── objects/
└── refs/
    ├── heads/
    └── tags/
The path types of Git repositories are difficult to understand at first glance. We have added / to the directory path so that you can refer to it. Also, we have omitted parts that are not explained in this article.


Git manage versions by file data called objects.

Objects are stored in .git/objects.


Objects has four types, blobtreecommittag.

The contents of each and the corresponding data will be as follows.

  • blob ... File data
  • tree ... Directory data
  • commit ... Metadata to manage the tree of the repository
  • tag ... Metadata for a specific commit * Not explained at this article.

Image with first.txt in the project repository


The Object is FILE DATA, so it has a file name (path) and the data stored in it, just like a normal file.

File name (path)
The file name (path) is 40-character string. This is a hash (sha-12) of object data.

Actually, the first two are the directory path and the remaining 38 are the file path.

Object data is compressed by zlib1. The decompressed data consists of two parts: header and content. The two elements are then separated by \0 (null byte).

header is a combination of the object type and the size of content.

content contains the corresponding data in an easy-to-handle format, as indicated by the type. (Later we will see the details).

How to Create blob Object

Index (staging area)

The actual index used when you add is a file .git/index.


The index stores data of files marked by add with meta information. The stored data contains the latest file data at the time of add.

It is important to note that all data recorded in the index is in file data units.
I will describe meta information in detail later, but the storage format is exactly defined as shown in index-format.

Hmm.... feel sleepy....


Let's actually analyze the object and the index!

Analyze Object

Before starting the analysis work, create all of the blob, tree, and commit.
Just add the files in project and commit.

Createing the following two files...


Hello World!
This is first.txt.
def second():
    print("This is")
next, add and commit.

git add -A
git commit -m 'initial'
Enter fullscreen mode Exit fullscreen mode

Then the contents of .git/objects are now as follows.

└── objects/
    ├── 48/
    |   └── c972ae2bb5652ada48573daf6d27c74db5a13f
    ├── af/
    |   └── 22102d62f1c8e6df5217b4cba99907580b51af
    ├── da/
    |   └── f3f26f3fa03da346999c3e02d5268cb9abc5c5
    └── f7/
        └── f18b17881d80bb87f281c2881f9a4663cfcf84
**From now on, hash values in the text will omit the number of characters. 3*

The corresponding data and hash values for each are summarized below.

hash value Object correspond data
f7f18b1 blob first.txt
af22102 blob
daf3f26 tree project direcrtory
48c972a commit commit version 1

*The analysis work will be conducted interactively using Python, an interpreted language.


blob is an object corresponding to file data.
The image looks like this.


First, let's look at f7f18b1, which corresponds to first.txt.

...Oops, I failed.

% python
>>> with open('.git/objects/f7/f18b17881d80bb87f281c2881f9a4663cfcf84', 'r') as f:
...     contnet =
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca in position 3: invalid continuation byte
Since the content is compressed, attempting to read the content as-is as a string 4 will fail.
Therefore, we read the content as binary.

>>> with open('.git/objects/f7/f18b17881d80bb87f281c2881f9a4663cfcf84', 'rb') as f: # read binary!
...     contnet =
>>> content
Then I read successfully and the byte string.

Now, decompress the content with zlib, as described in the official documentation.

>>> import zlib
>>> decompressed = zlib.decompress(content)
>>> decompressed
b'blob 31\x00Hello World!\nThis is first.txt.'
>>> decompressed.split(b'\0')
[b'blob 31', b'Hello World!\nThis is first.txt.']
We found that a blob consists of the following elements

  • header ... blob 31
  • Null byte ... \x00 ※hex notation
  • content ... Hello World!\nThis is first.txt.

File name

We should check whether the hash value of the object is indeed correct.

The file name of the object should be the value obtained by hashing decompressed with the hash function sha1, so check it.

>>> import hashlib
>>> blob = b'blob 31\x00Hello World!\nThis is first.txt.'
>>> sha1 = hashlib.sha1(blob).hexdigest()
>>> sha1
Great, exact match!

How about another file

Let's also look at af22102, which corresponds to the other

>>> with open('.git/objects/af/22102d62f1c8e6df5217b4cba99907580b51af', 'rb') as f:
...     contnet =
>>> decompressed = zlib.decompress(content)
>>> decompressed
b'blob 44\x00def second():\n    print("This is")'

>>> blob = b'blob 44\x00def second():\n    print("This is")'
>>> sha1 = hashlib.sha1(test).hexdigest()
>>> sha1
It can be summarized as follows

  • header ... blob 44
  • Null byte ... \x00
  • content ... def second():\n print("This is")

And the sha1 values (hash values) derived from the data also matched.

The blob itself does not hold the filename of the corresponding file data.

Instead of blob, the object that manages its name is tree.


tree is an object corresponding to directory data.
The image looks like this.

We will analyze it in the same way as for blob.

>>> with open('.git/objects/da/f3f26f3fa03da346999c3e02d5268cb9abc5c5', 'rb') as f:
...     content =
>>> decompressed = zlib.decompress(content)
>>> decompressed
b'tree 74\x00100644 first.txt\x00\xf7\xf1\x8b\x17\x88\x1d\x80\xbb\x87\xf2\x81\xc2\x88\x1f\x9aFc\xcf\xcf\x84100644\x00\xaf"\x10-b\xf1\xc8\xe6\xdfR\x17\xb4\xcb\xa9\x99\x07X\x0bQ\xaf'
>>> decompressed.split(b'\0')
[b'tree 74',
 b'100644 first.txt',
The tree has multiple contents, so we seem a bit complicated.

The tree contnet is composed of repeating mode5, path and hash, which are meta information about the data in the directory,

If you simply separate them with \0, the hash value of the previous data and the meta information of the next file data are attached to each other.

This is because the meta information and the hash value are separated by \0.

First, we will check the data stored in the first one.
Looking at the split, like first.txt is stored, right?

>>> temp = decompressed.split(b'\0')
>>> temp[1]
b'100644 first.txt'
>>> temp[2]
Enter fullscreen mode Exit fullscreen mode

In order to split temp[2] well, let's take it out by 20 bytes.
Array access of byte strings can be byte.

>>> temp[2][0:20]
>>> temp[2][0:20].hex()
>>> temp[2][20:]
Repeating the same process revealed the following.

  • header ... tree 74
  • Null byte ... \x00
  • content1 ... 100644 first.txt\x00f7f18b1...
  • content2 ... 100644\x00af22102...

The management of tree hashes is described in (Digression) deciphering Tree bytes!

A tree may contain not only a blob but also a tree.
That is, if there is a directory within a directory.
This is because tree, like blob, does not keep the directory name of itself and the corresponding data.


commit contains the tree of the repository directory with meta information.
The image looks like this.

Let's analyze!

>>> with open('.git/objects/48/c972ae2bb5652ada48573daf6d27c74db5a13f', 'rb') as f:
...     content =
>>> decompressed = zlib.decompress(content)
>>> decompressed
b'commit 188\x00tree daf3f26f3fa03da346999c3e02d5268cb9abc5c5\nauthor nopeNoshishi <> 1674995860 +0900\ncommitter nopeNoshishi <> 1674995860 +0900\n\ninitial\n'
>>> decompressed.split(b'\0')
[b'commit 188',
 b'tree daf3f26f3fa03da346999c3e02d5268cb9abc5c5\nauthor nopeNoshishi <> 1674995860 +0900\ncommitter nopeNoshishi <> 1674995860 +0900\n\ninitial\n']

# a little bit more
>>> header, content = decompressed.split(b'\0')
>>> header
b'commit 188'
>>> content
b'tree daf3f26f3fa03da346999c3e02d5268cb9abc5c5\nauthor nopeNoshishi <> 1674995860 +0900\ncommitter nopeNoshishi <> 1674995860 +0900\n\ninitial\n'
>>> content.split(b'\n')
[b'tree daf3f26f3fa03da346999c3e02d5268cb9abc5c5',
 b'author nopeNoshishi <> 1674995860 +0900', 
 b'committer nopeNoshishi <> 1674995860 +0900', 
The stored data are as follows.

  • header ... commit 188
  • Null byte ... \x00
  • tree ... tree daf3f26f3fa03da346999c3e02d5268cb9abc5c5
  • author ... author nopeNoshishi <> 167...
  • committer ... committer nopeNoshishi <> 167...
  • message ... initial

You can see that it contains the tree hash value that you saw in the tree chapter earlier, information about the repository owner and the person who made the commit, and the message.

I will go ahead with the commit and analyze it again.
Edit first.txt as follows and add and commit again.


Hello World!
This is first.txt.
git add first.txt
git commit -m 'second'
Then the contents of .git/objects are now as follows.

└── objects/
    ├── 3f/
    |   └── f934272  # new tree .. project repo version 2
    ├── 37/
    |   └── 349c9b0  # new commit .. "second"
    ├── 48/
    |   └── c972ae2  # old commit .. "initial"
    ├── af/
    |   └── 22102d6  # old blob .. version 1
    ├── c8/
    |   └── 843b4db  # new blob .. first.txt version 2
    ├── da/
    |   └── f3f26f3  # old tree .. project repo version 1
    └── f7/
        └── f18b178  # new blob .. first.txt version 1
See the new commit...

>>> with open('.git/objects/37/349c9b05c73281008e7b6b7453b595bb034a52', 'rb') as f:
...     content =
>>> decompressed = zlib.decompress(content)
>>> decompressed
b'commit 235\x00tree 3ff9342727caf81397740327aa406c1cc6d4408e\nparent 48c972ae2bb5652ada48573daf6d27c74db5a13f\nauthor nopeNoshishi <> 1675174139 +0900\ncommitter nopeNoshishi <> 1675174139 +0900\n\nsecond\n'
Enter fullscreen mode Exit fullscreen mode

The stored data are as follows.

  • header ... commit 188
  • Null byte ... \x00
  • tree ... tree daf3f26f3fa03da346999c3e02d5268cb9abc5c5
  • parent ... parent 48c972ae2bb5652ada48573daf6d27c74db5a13f
  • author ... author nopeNoshishi <> 167...
  • committer ... committer nopeNoshishi <> 167...
  • message ... second

The new commit stored the hash value of the previous version of commit.


The difference against blob or tree is that commit does not store the actual data in the repository. But it has meta data starting from tree.

Key-Value Store

Some of you may have an idea of what I'm talking about.

If you unravel a commit, you can get a tree, and if you unravel a tree, you can get a blob.


The version flow shows the history because commit knows the hash value of the previous commit.
This image shows the history of the current commit.


So Git manages file versions from the starting point, which is the hash value of the object.

Officially, Git is called Address (hash) File System.
The hash function itself is an invertible transformation, so the original data cannot be restored from the hash value, but as long as the hash value depends on the contents of the object to begin with, it may be called a value-value store.


In a world without version control systems like Git, what do you do when you want to keep your current files and work on something new with the same files?
Perhaps one way you might think of doing this is to copy the file and put it in another folder.
In fact, this seemingly weird management method is the closest form of version control that supports Git.

Git is a storage system that makes clever use of the OS file system.

Analize Index

The index (staging area) is veiled, but like the object, the design is very simple.
(On the other hand, it is a bit quirky to analyze. The dismantling of the index sucked up dozens of hours...

I'm going to analyze .git/index, which has been committed for the second time.


In order to analyze, we need to understand the design specification of index.

Referring to Index format in the official document, we found the following specifications.

Index Format
    - 4 bytes   Index header                * DIRC
    - 4 bytes   Index version            * basic version 2
    - 32 bits   number of entries in index  * Entries are the meta information for each file.

    - 32 bits   create file time
    - 32 bits   create file time at nano
    - 32 bits   modify file time
    - 32 bits   modify file time at nano
    - 32 bits   device id
    - 32 bits   inode
    - 32 bits   Permission (mode)
    - 32 bits   user id
    - 32 bits   group id
    - 32 bits   file size
    - 160 bits  `blob` hash value
    - 16 bits   filename size               * Number of bytes in filename string
    - ?  bytes  filename                    * Variable depending on file name
    - 1-8 bytes padding                     * Variable depending on entry

... The same thing continues by number of entries ....
Now that we have the specifications, we will read them again in python.

The index is uncompressed, but reads in binary format as well as the object because all meta information is stored in bytes.

>>> with open('.git/index', 'rb') as f:
...     index =
>>> index
b'DIRC\x00\x00\x00\x02\x00\x00\x00\x02c\xd9 \xf4\x05\xeb\x80\xb2c\xd9 \xf4\x05\xeb\x80\xb2\x01\x00\x00\x06\x00\xb8\'\x07\x00\x00\x81\xa4\x00\x00\x01\xf5\x00\x00\x00\x14\x00\x00\x00(\xc8\x84;M\xb8\x06\xe5\xd6Z\x12\xefV\xbfK\xeeQ\xe7\x15\'\x93\x00\tfirst.txt\x00c\xd6hv\x17\xa5\x05nc\xd6hv\x17\xa5\x05n\x01\x00\x00\x06\x00\xb8\'\x14\x00\x00\x81\xa4\x00\x00\x01\xf5\x00\x00\x00\x14\x00\x00\x00,\xaf"\x10-b\xf1\xc8\xe6\xdfR\x17\xb4\xcb\xa9\x99\x07X\x0bQ\xaf\x00\\x00TREE\x00\x00\x00\x19\x002 0\n?\xf94\'\'\xca\xf8\x13\x97t\x03\'\xaa@l\x1c\xc6\xd4@\x8e\xf2\xe4\xd7:\x95\xc1?\x18\xd3\xe9\x7f\x8fp\x9c$N\xc9dX\xa4'
It looks readable in places.
You can see the original DIRC, first.txt and!

Since 32bits is 4bytes, it can be easily pulled out.

>>> index[0:4]
b'DIRC' # Index header -> DIRC
>>> index[4:8]
b'\x00\x00\x00\x02' # Index version => 2
>>> index[8:12]
b'\x00\x00\x00\x02' # number of entries => 2
The index manages metadata per file, so you will have two entries, first.txt and

For the purpose of this article, I will just take a quick look at the meta information from the next creation time to the group ID, which is not very important except for the mode.

>>> index[12:16]
b'c\xd9 \xf4' # ctime
>>> index[16:20]
b'\x05\xeb\x80\xb2' # ctime nano
>>> index[21:24]
b'\xd9 \xf4' # mtime
>>> index[24:28]
b'\x05\xeb\x80\xb2'  # mtime nano
>>> index[28:32]
b'\x01\x00\x00\x06' # dev id
>>> index[32:36]
b"\x00\xb8'\x07" # inode
>>> index[36:40]
b'\x00\x00\x81\xa4' # mode
>>> index[41:44]
b'\x00\x01\xf5' # user id
>>> index[44:48]
b'\x00\x00\x00\x14' # gorup id
Here are the key points to look at.
First is the file size.

# file size
>>> index[48:52]
>>> index[48:52][0]
>>> index[48:52][1]
>>> index[48:52][2]
>>> index[48:52][3]
The file size of the next file to come is found to be 40bytes.

Next is the hash value.

# hash
>>> index[52:72]
>>> index[52:72].hex()
Enter fullscreen mode Exit fullscreen mode

We see the hash value matches the one in version 2 first.txt!

And the size of the filename.

# filename size
>>> index[72:74]
>>> index[72:74][0]
>>> index[72:74][1]
Enter fullscreen mode Exit fullscreen mode

This size (in bytes) is very important, without it, you will have to search for the next file name by your feeling.

Now that we know the filename is 9 bytes, we can...

>>> index[74:83]
We can extract the file name without missing anything.

Finally, padding depends on the number of bytes used to represent the entry.
The calculation method is to find X bytes such that the bytes up to the padding plus the X bytes to be padded is a multiple of 8.

Expressed as a formula, X (padding), y (filename size), a (remainder)


In this case, from creation time to file size, 62 bytes, and the file name is 9 bytes.


We found the bytes of padding was 1 byte.

>>> index[83:84]
>>> index[83:85]
b'\x00c' # There's one that isn't a null bite, and it's from the second bite!
>>> index[83:86]
The bytes of padding up to the next entry creation time was correctly matched.


Actually, when you add, tree is not created.
You commit, then tree will be generated from index.

index has important role to link added file data to blobs and manage which versions of files are committed.

You may have heared git dealed a snapshot, not difference.
In other words, when indexes have not been updated, file data will always remain unless explicitly excluded.
And that means that everything you commit can be restored through the index.

index is an important entity that holds the key to whether or not a file is subject to version control in Git.

Background of Command

Now that we know how Git handles data, let's take a quick look at how the commands behave.

The command has many options, so more complex behavior can be achieved, but I only describe a basic role.


add is responsible for adding, deleting, and updating the target file data to the index.
When added, git creates a blob of the instantaneous(latest) file data.

The plumbing commands that make this happen are hash-object and update-index.
※In Plumbing commands chapter, I describe the detail.


Git create a tree corresponding to the repository directory based on the index created, and then create a commit.
After the commit is successfully created, change the hash value of the commit that the HEAD and branch point to.

The plumbing commands that accomplish this are write-tree, commit-tree, and update-ref.


Deciphering Tree

We'll look into the byte in a bit.

What is the maximum value of a number that can be represented by a single (unsigned) byte?
2^8 - 1 = 255. This corresponds to the maximum number of hexadecimal digits that can be represented by two hexadecimal digits.

>>> temp[2][0]
247  # = `\xf7`
I used the hex() function quickly above, but if you look at it one byte at a time...

>>> hash = ''
>>> for hex in temp[2][0:20]:
...     hash += format(hex, 'x')
>>> hash
I can get the hash value of the blob corresponding to first.txt as a string!

hash are 40 characters, but each character is a value calculated in hexadecimal. So the trick is that one byte can represent two characters .

commit stores the hash value as a string, but for some reason the tree stores the hash value directly as bytes, not as a string.

There was some discussion on stackoverflow as to why.

HEAD and Branch

The Branch is responsible for marking specific commit objects.
It is stored under .git/refs/heads/.
You can easily see the contents with the Linux command cat.

Since we were working on the master branch earlier, we can look at .git/refs/heads/master and see ...

% cat .git/refs/heads/master
The hash value of the last committed commit object was stored.

The HEAD indicates which commit object you are basing your file edits on.
HEAD can point directly to a commit object, but it basically goes through branch.
.git/HEAD is what it is.

The data is stored as follows.

% cat .git/HEAD
ref: refs/heads/master
It contained the path about where the master branch is stored.

If you want to point directly to a commit (detached head), use checkout to move HEAD.

% git checkout 37349c9b05c73281008e7b6b7453b595bb034a52
% cat .git/HEAD
ref: 37349c9b05c73281008e7b6b7453b595bb034a52
Plumbing commands

To further manipulate Git at a low level, there is a command for every single action.
(These are god-like commands created by Mr. Linus for ordinary people like me.)


This command allows you to see the contents of an object.
We worked hard earlier to analyze the object, but this single command is the solution.

# See object type
% git cat-file -t af22102d62f1c8e6df5217b4cba99907580b51af #

# Output object content
% git cat-file -p af22102d62f1c8e6df5217b4cba99907580b51af #
def second():
    print("This is")
You can hash file data, etc. or store them in .git/objects.

Let's create

struct Third {
    message: String   
# calculate hash value
% git hash-object

# Create blob object
% git hash-object -w
% ls .git/objects/4a
This command adds the target file to the index.
Note, however, that no object is created.


This command provides a concise view of the contents of the index.

# see the latest index
% git ls-files

# add index cache
% git update-index --add 
% git ls-files
% git ls-files -s
100644 c8843b4db806e5d65a12ef56bf4bee51e7152793 0       first.txt
100644 af22102d62f1c8e6df5217b4cba99907580b51af 0
100644 4aa58eed341d5134f73f2e9378b4895e216a5cd5 0
We create a tree based on the contents of the index.
All directories, not just repository directory.

% git write-tree
% ls .git/objects/10 
We create the commit with the hash of the (repository directory) tree.

# Enter the hash value of the parent `commit` and the 
# hash value of the `tree` you just created
% git commit-tree -p 37349c9b05c73281008e7b6b7453b595bb034a52 -m 'third commit' 109e41a859caa3e3b87e8f59744b0b1845efe275
% ls .git/objects/dd 
Enter fullscreen mode Exit fullscreen mode


We can't just commit-tree and follow the history, because no one will see the commits you made.
This is because no one can see the commits we have made.

# Because the git log follows the history sequentially 
# from the commit pointed to by HEAD, the commit you
# just created is not yet referenced.
% git log
commit 37349c9b05c73281008e7b6b7453b595bb034a52 (HEAD -> master)
Author: nopeNoshishi <>
Date:   Tue Jan 31 23:08:59 2023 +0900


commit 48c972ae2bb5652ada48573daf6d27c74db5a13f
Author: nopeNoshishi <>
Date:   Sun Jan 29 21:37:40 2023 +0900


# Change the branch's references.
% git update-ref refs/heads/master ddb3c0d 37349c9 # new-hash old-hash
% git log
commit ddb3c0d94d860ff657e2cdb82f5513f7db2924f1 (HEAD -> master)
Author: nopeNoshishi <>
Date:   Thu Feb 2 21:17:24 2023 +0900

    third commit

commit 37349c9b05c73281008e7b6b7453b595bb034a52
Author: nopeNoshishi <>
Date:   Tue Jan 31 23:08:59 2023 +0900


In creating Git, it is difficult to suddenly create something as sophisticated as add or commit.
Therefore, while implementing the plumbing command , we will create add and commit in the development section to bypass the functionality of this command.


Thank you for reading all the way to the end!!!
This is still a rough explanation, but I hope it contributes to your understanding.
If you may ok, please star my repository!

Reference Site

Officail Documentation

What you need

Listed here are the key elements in making git.



Bitwise operation

n-decimal system and character strings


Compression algorithms

Hash function

File system



1: This is a free software to compress data losslessly. The main compression algorithm called Deflate is very interesting.Official Site back to article


2: One of the very famous SHA-based hash functions, characterized by the generation of a 60-bit (20-byte) hash value. Incidentally, the probability of a collision of sha1 hash values is said to be astronomical.The Reality of SHA1 back to article

hash number

3: When you specify a hash value directly in a Git command, you may only use 7 characters. As mentioned in [^2](#ano-2), this means that even with a small input hash value, we can find a specific object because there are almost no hash collisions. It is similar to the feeling of pressing tab in shell to receive input assistance. back to article

compressed string

4: Compressed data is stored in a form that does not correspond to a character code. Therefore, it cannot be read as a specific character code. back to article


5: The mode (permission) can of course also be expressed in binary. And since there are few combinations, certain combinations can be expressed in computation. back to article


