Vahid Ghadiri

Posted on Jul 7

A Deep Dive into Git Internals: Blobs, Trees, and Commits

1. Introduction

Have you ever wondered what’s happening under the hood when you run git commit? Git is far more than a version control system—it’s a content-addressable filesystem built on a robust object model. At its core, Git manages your codebase using four primary object types: blobs, trees, commits, and tags. This article dives deep into the three most critical components of Git’s commit history—commit objects, tree objects, and blob objects—to give you a clear, hands-on understanding of how Git organizes and stores your code.

Whether you’re a beginner curious about Git’s magic or a seasoned developer looking to master its internals, this guide will demystify Git’s architecture with practical examples and insights. Let’s explore how Git transforms your files into a structured, efficient, and resilient database.

2. Git as a Content-Addressable Filesystem

At its heart, Git is a content-addressable filesystem, meaning it identifies and stores data based on its content rather than its location or name. Every piece of data—whether a file, directory, or commit—is assigned a unique SHA-1 hash (or SHA-256 in newer Git versions) derived from its contents. These hashes act as fingerprints, ensuring that even a tiny change in content produces a completely different hash.

Git stores these objects in the .git/objects directory, where they are compressed using zlib to save space. The hash serves as both the object’s identifier and its address in the filesystem, making Git’s storage system efficient and deduplicated.

Try It Out

To see this in action, let’s create a simple object:

echo "hello world" | git hash-object -w --stdin

This command generates a SHA-1 hash for the string "hello world", stores it as a blob in .git/objects, and outputs the hash (e.g., 557db03...). The -w flag tells Git to write the object to its database. You’ll find the object in a subdirectory named after the first two characters of the hash (e.g., .git/objects/55/7db03...).

This hash-based system ensures that identical content is stored only once, regardless of how many times it appears in your repository’s history.

3. Blob Internals: Storing File Content

A blob (binary large object) is the simplest Git object. It stores the raw content of a file—nothing more, nothing less. Blobs don’t care about file names, permissions, or directory structures; they’re just a snapshot of a file’s contents at a given moment.

Blob Format

A blob is stored with a simple header:

blob <content-length>\0<actual-content>

For example, a file containing console.log('Hi') would be stored as:

blob 17\0console.log('Hi')

Example: Adding a File

Let’s create a file and see how Git handles it:

echo "console.log('Hi')" > index.js
git add index.js
git hash-object index.js

Running git hash-object index.js will output the SHA-1 hash of the blob (e.g., 7b19fa88dd...). To inspect the blob, use:

git cat-file -t 7b19fa88dd    # Outputs: blob
git cat-file -s 7b19fa88dd    # Outputs: size (e.g., 17)
git cat-file -p 7b19fa88dd    # Outputs: console.log('Hi')

Why It Matters

Blobs are immutable and content-addressed. If you commit the same index.js file in multiple commits without changing its content, Git reuses the same blob, saving space. This deduplication is a cornerstone of Git’s efficiency.

Pro Tip: If you rename a file but don’t change its content, Git still references the same blob, as the file’s name is stored elsewhere (in tree objects).

4. Tree Internals: Mapping the Directory Structure

While blobs store file content, tree objects represent the directory structure of your project. A tree is like a snapshot of a folder, listing its contents—files (blobs) and subdirectories (other trees)—along with metadata like file names and permissions.

Tree Format

Each entry in a tree object includes:

File Mode: Permissions, e.g., 100644 for a regular file, 100755 for an executable, or 040000 for a subdirectory.
Object Type: Either blob (for files) or tree (for subdirectories).
SHA-1 Hash: The hash of the referenced blob or tree.
Filename: The name of the file or directory.

For example, a tree might look like this:

100644 blob a3c1f80e3d README.md
100644 blob 7b19fa88dd index.js
040000 tree b12fc09b8d src

This tree represents a directory with two files (README.md and index.js) and a subdirectory (src).

Example: Inspecting a Tree

To explore a tree, first find the hash of a commit’s root tree (more on commits later), then use:

git ls-tree <tree-hash>
git cat-file -p <tree-hash>

The git ls-tree command lists the tree’s contents in a human-readable format, while git cat-file -p shows the raw structure.

Why It Matters

Trees are the glue that connects blobs into a coherent project structure. They allow Git to track directories and their contents, enabling snapshots of your entire codebase at any point in time.

Pro Tip: Trees are also deduplicated. If two commits reference identical directory structures, Git reuses the same tree object, further optimizing storage.

5. Commit Internals: Capturing Snapshots

A commit object is the heart of Git’s history. It represents a snapshot of your project at a specific point in time, tying together the root tree, metadata, and references to previous commits.

Commit Structure

A commit object contains:

tree: The SHA-1 hash of the root tree, representing the entire project’s directory structure.
parent(s): The hash(es) of the parent commit(s). A merge commit has multiple parents.
author: The person who wrote the code (name and email).
committer: The person who created the commit (may differ during rebases or cherry-picks).
message: The commit message describing the changes.
timestamp: When the commit was made.

Example: Anatomy of a Commit

Here’s what a commit object looks like:

commit 3f1d2ab273...
tree a8e23f1c0b...
parent 72ba9fc012...
author Vahid <vahid@example.com> 1697059200 +0000
committer Vahid <vahid@example.com> 1697059200 +0000
Fix broken user registration logic

To inspect a commit:

git cat-file -p <commit-hash>

Key Insight

Commits don’t store diffs! Instead, they reference a tree that represents the full state of your project. When you view a diff (e.g., with git diff), Git computes it on the fly by comparing trees.

Pro Tip: The distinction between author and committer is subtle but important. For example, during a git rebase, the committer might change (you, applying the rebase), while the author remains the original coder.

6. The DAG: Directed Acyclic Graph

Git’s history is structured as a Directed Acyclic Graph (DAG), where each commit points to its parent(s). This structure enables powerful features like branching, merging, and history traversal.

Visualizing the DAG

Here’s a simple commit history:

* d3e45f2 (HEAD -> main) Update footer text
* c1b8fa2 Add new logo
* b5e4fa1 Initial commit

To see the DAG in action:

git log --oneline --graph --all

This command displays a visual representation of your commit history, showing branches and merges as a graph.

Why It Matters

The DAG makes Git’s history flexible and robust. Branching is just a pointer to a commit, and merging creates a new commit with multiple parents. Understanding the DAG helps you navigate complex histories and resolve merge conflicts with confidence.

Pro Tip: Use git log --graph with --pretty=fuller to see detailed commit metadata alongside the graph.

7. Compression and Packfiles: Optimizing Storage

Initially, Git stores objects as individual files in .git/objects. Over time, as your repository grows, Git optimizes storage by creating packfiles using the git gc (garbage collection) command.

Packfiles: Objects are delta-compressed into .pack files, storing only the differences between similar objects to save space.
Index Files: .idx files act as an index for quick access to objects in the packfile.

Example: Inspecting Packfiles

After running git gc, check the packfiles:

git verify-pack -v .git/objects/pack/pack-<hash>.idx

This command lists the objects in the packfile, showing their relationships and compression details.

Why It Matters

Packfiles significantly reduce disk usage, especially in large repositories with many similar files or commits. They’re why Git can store years of project history efficiently.

Pro Tip: Run git gc manually to optimize your repository, but be cautious—it’s a one-way process, and objects may become harder to inspect individually.

8. Inspecting Git Internals with Plumbing Commands

Git provides two types of commands: porcelain (user-friendly, like git add or git commit) and plumbing (low-level, for scripting and debugging). Plumbing commands let you peek into Git’s internals.

Key Plumbing Commands

git hash-object -w <file>: Computes the SHA-1 hash of a file and optionally writes it to .git/objects.
git cat-file -p <hash>: Displays the content of an object (blob, tree, or commit).
git ls-tree <tree-hash>: Lists the contents of a tree object.
git rev-list --objects --all: Lists all objects in the repository’s history.

Example: Exploring Your Repository

Find a commit hash with git log, then:

git cat-file -p <commit-hash>  # View commit details
git ls-tree <tree-hash>        # View the root tree
git cat-file -p <blob-hash>    # View a file’s content

Why It Matters

Plumbing commands are your toolkit for debugging and understanding Git’s behavior, especially during complex operations like rebases or recovering lost commits.

Pro Tip: Use git rev-parse HEAD to get the hash of the current commit, then explore its tree and blobs.

9. Putting It All Together

When you run git commit, Git performs a series of steps to create a snapshot of your project:

Creates Blobs: Each modified file is hashed and stored as a blob in .git/objects.
Builds Trees: Git constructs tree objects to represent the directory structure, linking to blobs and subtrees.
Creates a Commit: A commit object is created, referencing the root tree, parent commits, and metadata like the author and message.

Visual Summary

commit ---> tree ---> blobs
  |           |-- README.md
  |           |-- index.js
  |           +-- src/ (tree) ---> blobs (file1.js, file2.js)

This structure ensures that every commit is a complete, immutable snapshot of your project.

Pro Tip: To visualize this, run git log --oneline --graph and then use git cat-file -p on a commit to trace its tree and blobs.

10. Bonus: Snapshot vs. Diff — A Practical Demonstration

To understand Git’s snapshot-based approach, let’s try a hands-on example:

Create and commit a file:

echo "hello world" > hello.txt
git add hello.txt
git commit -m "Add hello"

Modify the file and recommit:

echo "hello again" > hello.txt
git add hello.txt
git commit -m "Change hello"

Compare the blobs:

git log --oneline  # Find the commit hashes
git cat-file -p <first-commit-hash>  # Get the tree hash
git ls-tree <tree-hash>  # Get the blob hash for hello.txt
git cat-file -p <blob-hash>  # Outputs: hello world

Repeat for the second commit’s blob to see the new content (hello again). Notice that the two blobs have different hashes because their content changed, but unchanged files would reuse the same blob.

Why It Matters

Git’s snapshot-based model (not diff-based) means it stores the full state of your project for each commit. Diffs are computed on demand, which makes operations like git blame or git diff flexible but computationally intensive.

Pro Tip: Use git diff <commit1> <commit2> to see how Git computes differences between two snapshots.

11. Key Takeaways

Commits are snapshots, not diffs, referencing a root tree that captures the entire project state.
Trees organize the directory structure, linking to blobs and subtrees.
Blobs store raw file content, independent of names or permissions.
Git’s content-addressable storage and DAG enable efficient deduplication and history manipulation.
Plumbing commands like git cat-file, git ls-tree, and git rev-list unlock Git’s internals for debugging and exploration.
Understanding these concepts helps you tackle merge conflicts, optimize rebases, and recover lost data with confidence.

12. Final Thoughts

Git’s object model—blobs, trees, and commits—transforms it into a powerful, content-addressable database. By storing data as immutable, hash-addressed objects, Git ensures efficiency, resilience, and flexibility. The next time you run git commit, take a moment to appreciate the elegant machinery at work: a deduplicated, compressed, and perfectly organized snapshot of your project’s history.

Mastering Git’s internals not only makes you a better developer but also empowers you to wield Git’s full potential, from resolving complex merge conflicts to scripting custom workflows.

13. Further Exploration

Read the official Git Internals Book: Git Internals - Plumbing and Porcelain
Read the Pro Git Book: Pro Git - Scott Chacon, Ben Straub

DEV Community

A Deep Dive into Git Internals: Blobs, Trees, and Commits

1. Introduction

2. Git as a Content-Addressable Filesystem

Try It Out

3. Blob Internals: Storing File Content

Blob Format

Example: Adding a File

Why It Matters

4. Tree Internals: Mapping the Directory Structure

Tree Format

Example: Inspecting a Tree

Why It Matters

5. Commit Internals: Capturing Snapshots

Commit Structure

Example: Anatomy of a Commit

Key Insight

6. The DAG: Directed Acyclic Graph

Visualizing the DAG

Why It Matters

7. Compression and Packfiles: Optimizing Storage

Example: Inspecting Packfiles

Why It Matters

8. Inspecting Git Internals with Plumbing Commands

Key Plumbing Commands

Example: Exploring Your Repository

Why It Matters

9. Putting It All Together

Visual Summary

10. Bonus: Snapshot vs. Diff — A Practical Demonstration

Why It Matters

11. Key Takeaways

12. Final Thoughts

13. Further Exploration

Top comments (0)