Wassim Chegham

Posted on Feb 20, 2022 • Edited on Mar 19, 2022

Understanding Git Internals

#git #programming #beginners #tutorial

Let's explore some common Git commands, and dive into Git internals to understand what happens when you run Git commands.

But first, let's talk about Git itself.

What is Git?

Put simply, Git is an open source distributed version control system. It was designed by Linus Torvalds, creator of the Linux kernel, to manage the source code of the kernel. Git was designed from the start to be as fast and efficient as possible.

Git's Principles

In other version control systems such as CVS, Subversion, and ClearCase, the server is centralized — there's a clear separation between the server and clients.

When developers work on projects that use these systems, they first send a checkout request to the server, then retrieve a snapshot of the current version — usually the most recent one. Everyone has to go through the central server in order to work on the same project, sending commits or creating branches.

With Git, things are different. When we want to work on a project that uses Git, we clone it locally, on to our machine. In other words, Git copies all project files to our local drive, then we can work on the project. All operations run locally on our machine. We don't even need a network connection, except to synchronize with other team members, when pushing our changes or pulling new changes.

That's what makes Git so quick.

With Git, we can:

commit changes.
change and create branches.
merge or rebase branches.
retrieve a diff or apply a patch.
recover different versions of the same file.
access the change history of any file.

And we can do most of this without even being connected to the Internet. Amazing, right?

Let's talk about some of the very basic Git commands when it comes to staging and committing changes to Git.

A brief intro to Git basic commands

When we access a folder that contains the source code of an application we are working on, but doesn't use Git, and run git status. Git will respond that the current directory is not a Git repository:

That's because we haven't initialized Git in this project. We need to first run git init in the root directory in order to initialize a new Git repository.

As we can see from the screenshot, we created an empty Git repository, and we are currently on its default branch — usually called main.
We can also notice that Git creates a .git/ folder at the root of the project. This hidden directory is Git's internal database. If you wish to make a backup of your project's Git history, simply make a copy of this directory.

Let's run git status again to see what's the status of our new project:

Git tells us that we haven't added anything to our commit. What Git means here is that we haven't revisioned any content yet. In Git, we can do this in 2 steps:

mark the change to be committed using git add.
commit the change to Git using git commit.

Let's add the content of the current root directory with the git add . command:

The next step is to commit these files using the command git commit -m "my first commit":

Let's change the title of our app in src/index.html and then run the command git diff:

Git shows what changes are waiting to be committed. We can view those changes using the git status. Once we are sure about our changes, we can commit them using the commands git add . followed by git commit -m "add thundr emoji". Or, we can combine these two commands into one using git commit -am "add thundr emoji" (note the -am options):

This was a brief intro to some of the basic Git commands.

You can learn more about Git in this Git tutorial.

Now let's look at how Git does all this.

Git Objects

When we create a local repository with git init (or git clone), Git initializes its database, and saves it in a hidden directory called .git/:

If we examine this folder, we'll find several subfolders and files. The most interesting ones are:

HEAD: this file contains the path to the reference of the current branch.
config: the repository configuration file.
objects/: this is the directory that contains all files in our repository, their content is encoded and compressed.
refs/heads/: this directory contains one file per branch. Each file is named after a branch, and its content is the SHA-1 of the last commit.

When we create or edit files and run git add ., Git adds a snapshot of the files to its database. It retrieves the current content of the edited files, then computes a hash using a SHA-1 algorithm, and creates an entry in its database for this specific change. The key of this entry is the SHA-1 hash, and its value is the raw contents of the file. Yes, the whole content!



.git/objects/
➜ tree -L 2
.
├── 61
│   └── cef767aa6b95d09a46848c9305d51a7d2ffbdf
├── c4
│   └── 581986199f973ea6b006a7a754486db97104c5
├── ...
├── f5
│   ├── 31992d6edce8355d131c31723398305ee08364
│   └── 6ff47022c7a46eac3aee615d6d9150338524dc
└── ...

Git uses a technique called Content Addressable Storage to store the content of files, as objects, inside the .git/objects/folder. Basically, Git uses the first 2 characters of the object's name as a subfolder name, i.e. objects/xx/, add the remaining 38 characters are used as the file name, i.e. objects/xx/yy.

Note that when referring or inspecting the content of these files and folders, we can simply use the first 8 bits of the SHA-1 hash. Those are enough!

Tree objects

Git uses a simplified UNIX-filesystem-like format to store tree its content. Content is stored as tree and blob objects. Trees objects is similar to UNIX directory entries and blob objects are more or less inodes or file contents. A single tree object contains one or more entries, each of which is the SHA-1 hash of a blob or subtree with its associated mode, type, and filename.

We can inspect the current branch in Git using git ls-tree and print a list of its latest stored objects. A branch's name is considered to be a top-level tree:



@manekinekko/thundr
➜ git ls-tree main
100644 blob 4f9ac26980c156a3d525267010d5f78144b43519    .browserslistrc
100644 blob 0711527ef9d5c117396e6c03290a76658e6384ed    .gitignore
040000 tree 572a53f73bb332280d19f2f7cb0bcac8b32faab5    .vscode
100644 blob 7c582e3fcc8e8d02e8e018eb43959020c2aef847    README.md
100644 blob a1486ac82746b7019361a1758f00dc6fd7a29efd    angular.json
100644 blob c6f92b3680a069987a4c1236ec12c7b377eda1ea    package.json
040000 tree 61cef767aa6b95d09a46848c9305d51a7d2ffbdf    src
100644 blob 82d91dc4a4de57f380b66c59cdd16ff6cd5798e4    tsconfig.app.json
100644 blob f531992d6edce8355d131c31723398305ee08364    tsconfig.json

Notice that the src subdirectory is a tree that points to another tree:



040000 tree 61cef767aa6b95d09a46848c9305d51a7d2ffbdf    src

We can inspect the src tree using its SHA-1 hash:



@manekinekko/thundr
➜ git ls-tree 61cef767aa6b95d09a46848c9305d51a7d2ffbdf
040000 tree dbfd18df3078aaf0f03edff8c237e4248a84759c    app
040000 tree d564d0bc3dd917926892c55e3706cc116d5b165e    assets
040000 tree 5825478d74b5576b0f8330bdf5179b3ec6503fb8    environments
100644 blob 997406ad22c29aae95893fb3d666c30258a09537    favicon.ico
100644 blob c4581986199f973ea6b006a7a754486db97104c5    index.html
100644 blob c7b673cf44b388e9989fe908b78d7d73cd2e1409    main.ts
100644 blob 429bb9ef2d3400363c014fa434775b0f482f6bea    polyfills.ts
100644 blob 90d4ee0072ce3fc41812f8af910219f9eea3c3de    styles.css

Tree objects list one or multiple blobs and sub-trees, with their mode/permissions, the type, the SHA-1 hash, and the file/folder name.

Looking at the index.html entry, we notice that the content is stored are a blob:



100644 blob c4581986199f973ea6b006a7a754486db97104c5    index.html

Blob objects

For blobs, we can read their content using the git-cat command:



@manekinekko/thundr 
➜ git cat-file -p c4581986199f973ea6b006a7a754486db97104c5
<!doctype html>
<html lang="en">
<head>
  <meta charset="utf-8">
  <title>Thundr</title>
  <base href="/">
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <link rel="icon" type="image/x-icon" href="favicon.ico">
</head>
<body>
  <app-root></app-root>
</body>
</html>

Let's update the content of the src/index.html file, and run again a git add .. Git performs the same process as before. It creates a new entry in its database, and because the file content changed, the SHA-1 hash has also changed. When we do a git commit, Git recreates a new tree structure with a new SHA-1 hash:



@manekinekko/thundr
➜ git ls-tree main
...
040000 tree 2d56d5d1aa7fc5f2b4133f5c19088625a2a508db    src

@manekinekko/thundr
➜ git ls-tree 2d56d5d1aa7fc5f2b4133f5c19088625a2a508db
...
100644 blob 83a8d59dbac0603a0b108fadb20af24cb575ecca    index.html

Note that both SHA-1 hashes of src tree and index.html blob have changed because Git created a new tree structure that stores the new file content.

We can inspect again the content of the new stored blob and see the change we made (we added a thunder ⚡️ emoji in the title):



@manekinekko/thundr
git cat-file -p 83a8d59dbac0603a0b108fadb20af24cb575ecca
<!doctype html>
<html lang="en">
<head>
  <meta charset="utf-8">
  <title>Thundr ⚡️</title>
  <base href="/">
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <link rel="icon" type="image/x-icon" href="favicon.ico">
</head>
<body>
  <app-root></app-root>
</body>
</html>

Next, let's rename that file to, e.g., src/index-2.html, and then commit that changes again. Interestingly, Git does not create a new entry in the database for this file because the content — and hence the SHA-1 hash — have not changed. However, the SHA-1 of the src tree is now different because Git created a new tree structure:



@manekinekko/thundr
➜ git ls-tree main | grep src
040000 tree c561e2cc7ad896d6880f47df384b19ddb0c602e6    src

@manekinekko/thundr
➜ git ls-tree c561e2cc | grep index-2.html
100644 blob 83a8d59dbac0603a0b108fadb20af24cb575ecca    index-2.html

As you can see, the SHA-1 hashes of both file names index.html and index-2.html are the same, and that's because their content is identical!



100644 blob 83a8d59dbac0603a0b108fadb20af24cb575ecca    index.html
100644 blob 83a8d59dbac0603a0b108fadb20af24cb575ecca    index-2.html

In order to track who saved the snapshots and where they are saved (which exact tree), Git stores this information in Commit objects.

Commit objects

A commit object keeps track of:

the hash of the top-level tree where the file snapshot is located
the hash of the parent commit (if any).
the author/committer's name extracted from the user.name and user.email configuration settings.
the timestamp of the commit.
the GPG signature of the commit (if any).
a blank line.
the commit message.

Git stores commit object as blobs. We can inspect the content of a commit object using the git cat-file command:



@manekinekko/thundr
➜ git cat-file -p cc5e291471deb4f264cbad64d8a6570ca1487e83
tree 9ac02461fe7c94c834d28efa54952e8953002ef0
author Wassim Chegham <github@wassim.dev> 1645222639 +0100
committer Wassim Chegham <github@wassim.dev> 1645222639 +0100
gpgsig -----BEGIN PGP SIGNATURE-----

 iQIzBAA...
 -----END PGP SIGNATURE-----

my first commit

Here's what's actually happening in Git's internal database:

Note that in this diagram, we are using the first 8 bits of the SHA-1 hashes to identify Tree, Blob and Commit objects. Those are enough!

As you may have guessed, Git does not really care about file names. It cares more about their content. Even if we copy a file, Git will not create a new entry in its database. It's just a matter of content and SHA-1 hashes.

And if you're wondering what does Git do when we run a git push? Git computes the delta between the two files, compresses that diff, and sends it to the server. Yes, Git does not send the whole file's content.

Voilà!

That was it! Now we know what Git does behind the shadow when we create, edit, commit, and push files.

Follow me on Twitter at @manekinekko for more content!

DEV Community