Git Internals: How does Git store your data?

#git #repository #project

by Pragati Verma

In this article, we’ll be learning about the basics of the data storage mechanism for git.

The most fundamental term we know regarding git and data storage is repositories. Let’s first understand what a git repository is and where it stands in terms of data storage in git.

Repositories
A git repository can be seen as a database containing all the information needed to retain and manage the revisions and history of a project. In git, repositories are used to retain a complete copy of the entire project throughout its lifetime.

Git maintains a set of configuration values within each repository such as the repository user’s name and email address. Unlike the file data or other repository metadata, configuration settings are not propagated from one repository to another during a clone, or fork, or any other duplication operation. Instead of this, git manages and stores configuration settings on a per-site, per-user, and per-repository basis.

Inside a git repository, there are two data structures – the object store and the index. All of this repository data is stored at the root of your working directory inside a hidden folder named .git. You can read more about what’s inside your .git folder here.

As part of the system that allows a fully distributed VCS, the object store is intended to be effectively replicated during a cloning process. The index is temporary data that is private to a repository and may be produced or edited as needed.

Let’s discuss object storage and index in further depth in the next section.

Git Object Types
Object store lies at the heart of the git’s data storage mechanism. It contains your original data files, all the log messages, author information, and other information required to rebuild any version or branch of the project.

Git places the following 4 types of objects in its object store which form the foundation of git’s higher-level data structures:

blobs
trees
commits
tags
Let’s look a bit more about these object types:

Blobs
A blob represents each version of a file. “Blob” is an abbreviation for “binary big object,” a phrase used in computers to refer to a variable or file that may contain any data and whose underlying structure is disregarded by the application.

A blob is considered opaque it contains the data of a file but no metadata or even the file’s name.

Trees
A tree object represents a single level of directory data. It saves blob IDs, pathnames, and some metadata for all files in a directory. It may also recursively reference other (sub)tree objects, allowing it to construct a whole hierarchy of files and subdirectories.

Commits
Each change made into the repository is represented by a commit object, which contains metadata such as the author, commit date, and log message.

Each commit links to a tree object that records the state of the repository at the moment the commit was executed in a single full snapshot. The initial commit, also known as the root commit, has no parents and the following most of the commits have single parents.

A Directed Acyclic Graph is used to arrange commits. For those who missed it in Data Structures, it simply implies that commits “flow” in one way. This is usually just the trail of history for your repository, which might be very basic or rather complicated if you have branches.

Tags
A tag object gives a given object, generally a commit, an arbitrary but presumably human-readable name such as Ver-1.0-Alpha.

All of the information in the object store evolves and changes over time, monitoring and modeling your project’s updates, additions, and deletions. Git compresses and saves items in pack files, which are also stored in the object store, to make better use of disc space and network traffic.

Index
The index is a transient and dynamic binary file that describes the whole repository’s directory structure. More specifically, the index captures a version of the general structure of the project at some point in time. The state of the project might be represented by a commit and a tree at any point in its history, or it could be a future state toward which you are actively building.

One of the primary characteristics of Git is the ability to change the contents of the index in logical, well-defined phases. The indicator distinguishes between gradual development stages and committal of such improvements.

How does git monitor object history?
The Git object store is organized and implemented as a storage system with content addresses. Specifically, each item in the object store has a unique name that is generated by applying SHA1 to the object’s contents, returning a SHA1 hash value.

Because the whole contents of an object contribute to the hash value, and because the hash value is thought to be functionally unique to that specific content, the SHA1 hash is a suitable index or identifier for that item in the object database. Any little modification to a file causes the SHA1 hash to change, resulting in the new version of the file being indexed separately.

For monitoring history, Git keeps only the contents of the file, not the differences between separate files for each modification. The contents are then referenced by a 40-character SHA1 hash of the contents, which ensures that it is almost certainly unique.

The fact that the SHA1 hash algorithm always computes the same ID for identical material, regardless of where that content resides, is a significant feature. In other words, the same file content in multiple folders or even on separate machines produces the same SHA1 hash ID. As a result, a file’s SHA1 hash ID is a globally unique identifier.

Every object has an SHA, whether it’s a commit, tree, or blob, so get to know them. Fortunately, they are easily identified by the first seven characters, which are generally enough to identify the entire string.

One fantastic benefit of saving only the content is that if you have two or more copies of the same file in your repository, Git will only save one internally.

Conclusion
In this article, we learned about the two primary data structures used by git to enable data storage, management, and tracking history. We also discussed the 4 types of object types and the different roles played by them in git’s data storage mechanism.

This was all for this article, I hope you find it helpful. These are the fundamental components of Git as we know it today and use on a regular basis.

Already a developer interested in new tools and technologies? Take the Developer Nation survey, share your views for 2023 trends and shape the future of the Developer Ecosystem. You will get a virtual goody bag with free resources, plus a chance to win an iPhone 13, a Samsung Galaxy S22, Amazon vouchers and more. Start here

About the author, Deborah Graham
Deborah Graham is a professional developer. She has over 20 years of experience with various programming languages.
She was asked by Developer Nation to document her journey from WWDC 2022 to an updated version of the app, totally re-written in SwiftUI, into the AppStore. Follow along and see what works, what doesn’t work, and let’s see if she can get the app ready for version 2.0! You can email the author here FromInterpretedBasicToSwiftUI@gmail.com

DEV Community

Git Internals: How does Git store your data?

Top comments (0)

Join the Community