1. Introduction
Have you ever wondered what’s happening under the hood when you run git commit
? Git is far more than a version control system—it’s a content-addressable filesystem built on a robust object model. At its core, Git manages your codebase using four primary object types: blobs, trees, commits, and tags. This article dives deep into the three most critical components of Git’s commit history—commit objects, tree objects, and blob objects—to give you a clear, hands-on understanding of how Git organizes and stores your code.
Whether you’re a beginner curious about Git’s magic or a seasoned developer looking to master its internals, this guide will demystify Git’s architecture with practical examples and insights. Let’s explore how Git transforms your files into a structured, efficient, and resilient database.
2. Git as a Content-Addressable Filesystem
At its heart, Git is a content-addressable filesystem, meaning it identifies and stores data based on its content rather than its location or name. Every piece of data—whether a file, directory, or commit—is assigned a unique SHA-1 hash (or SHA-256 in newer Git versions) derived from its contents. These hashes act as fingerprints, ensuring that even a tiny change in content produces a completely different hash.
Git stores these objects in the .git/objects
directory, where they are compressed using zlib to save space. The hash serves as both the object’s identifier and its address in the filesystem, making Git’s storage system efficient and deduplicated.
Try It Out
To see this in action, let’s create a simple object:
echo "hello world" | git hash-object -w --stdin
This command generates a SHA-1 hash for the string "hello world"
, stores it as a blob in .git/objects
, and outputs the hash (e.g., 557db03...
). The -w
flag tells Git to write the object to its database. You’ll find the object in a subdirectory named after the first two characters of the hash (e.g., .git/objects/55/7db03...
).
This hash-based system ensures that identical content is stored only once, regardless of how many times it appears in your repository’s history.
3. Blob Internals: Storing File Content
A blob (binary large object) is the simplest Git object. It stores the raw content of a file—nothing more, nothing less. Blobs don’t care about file names, permissions, or directory structures; they’re just a snapshot of a file’s contents at a given moment.
Blob Format
A blob is stored with a simple header:
blob <content-length>\0<actual-content>
For example, a file containing console.log('Hi')
would be stored as:
blob 17\0console.log('Hi')
Example: Adding a File
Let’s create a file and see how Git handles it:
echo "console.log('Hi')" > index.js
git add index.js
git hash-object index.js
Running git hash-object index.js
will output the SHA-1 hash of the blob (e.g., 7b19fa88dd...
). To inspect the blob, use:
git cat-file -t 7b19fa88dd # Outputs: blob
git cat-file -s 7b19fa88dd # Outputs: size (e.g., 17)
git cat-file -p 7b19fa88dd # Outputs: console.log('Hi')
Why It Matters
Blobs are immutable and content-addressed. If you commit the same index.js
file in multiple commits without changing its content, Git reuses the same blob, saving space. This deduplication is a cornerstone of Git’s efficiency.
Pro Tip: If you rename a file but don’t change its content, Git still references the same blob, as the file’s name is stored elsewhere (in tree objects).
4. Tree Internals: Mapping the Directory Structure
While blobs store file content, tree objects represent the directory structure of your project. A tree is like a snapshot of a folder, listing its contents—files (blobs) and subdirectories (other trees)—along with metadata like file names and permissions.
Tree Format
Each entry in a tree object includes:
-
File Mode: Permissions, e.g.,
100644
for a regular file,100755
for an executable, or040000
for a subdirectory. -
Object Type: Either
blob
(for files) ortree
(for subdirectories). - SHA-1 Hash: The hash of the referenced blob or tree.
- Filename: The name of the file or directory.
For example, a tree might look like this:
100644 blob a3c1f80e3d README.md
100644 blob 7b19fa88dd index.js
040000 tree b12fc09b8d src
This tree represents a directory with two files (README.md
and index.js
) and a subdirectory (src
).
Example: Inspecting a Tree
To explore a tree, first find the hash of a commit’s root tree (more on commits later), then use:
git ls-tree <tree-hash>
git cat-file -p <tree-hash>
The git ls-tree
command lists the tree’s contents in a human-readable format, while git cat-file -p
shows the raw structure.
Why It Matters
Trees are the glue that connects blobs into a coherent project structure. They allow Git to track directories and their contents, enabling snapshots of your entire codebase at any point in time.
Pro Tip: Trees are also deduplicated. If two commits reference identical directory structures, Git reuses the same tree object, further optimizing storage.
5. Commit Internals: Capturing Snapshots
A commit object is the heart of Git’s history. It represents a snapshot of your project at a specific point in time, tying together the root tree, metadata, and references to previous commits.
Commit Structure
A commit object contains:
- tree: The SHA-1 hash of the root tree, representing the entire project’s directory structure.
- parent(s): The hash(es) of the parent commit(s). A merge commit has multiple parents.
- author: The person who wrote the code (name and email).
- committer: The person who created the commit (may differ during rebases or cherry-picks).
- message: The commit message describing the changes.
- timestamp: When the commit was made.
Example: Anatomy of a Commit
Here’s what a commit object looks like:
commit 3f1d2ab273...
tree a8e23f1c0b...
parent 72ba9fc012...
author Vahid <vahid@example.com> 1697059200 +0000
committer Vahid <vahid@example.com> 1697059200 +0000
Fix broken user registration logic
To inspect a commit:
git cat-file -p <commit-hash>
Key Insight
Commits don’t store diffs! Instead, they reference a tree that represents the full state of your project. When you view a diff (e.g., with git diff
), Git computes it on the fly by comparing trees.
Pro Tip: The distinction between author
and committer
is subtle but important. For example, during a git rebase
, the committer might change (you, applying the rebase), while the author remains the original coder.
6. The DAG: Directed Acyclic Graph
Git’s history is structured as a Directed Acyclic Graph (DAG), where each commit points to its parent(s). This structure enables powerful features like branching, merging, and history traversal.
Visualizing the DAG
Here’s a simple commit history:
* d3e45f2 (HEAD -> main) Update footer text
* c1b8fa2 Add new logo
* b5e4fa1 Initial commit
To see the DAG in action:
git log --oneline --graph --all
This command displays a visual representation of your commit history, showing branches and merges as a graph.
Why It Matters
The DAG makes Git’s history flexible and robust. Branching is just a pointer to a commit, and merging creates a new commit with multiple parents. Understanding the DAG helps you navigate complex histories and resolve merge conflicts with confidence.
Pro Tip: Use git log --graph
with --pretty=fuller
to see detailed commit metadata alongside the graph.
7. Compression and Packfiles: Optimizing Storage
Initially, Git stores objects as individual files in .git/objects
. Over time, as your repository grows, Git optimizes storage by creating packfiles using the git gc
(garbage collection) command.
-
Packfiles: Objects are delta-compressed into
.pack
files, storing only the differences between similar objects to save space. -
Index Files:
.idx
files act as an index for quick access to objects in the packfile.
Example: Inspecting Packfiles
After running git gc
, check the packfiles:
git verify-pack -v .git/objects/pack/pack-<hash>.idx
This command lists the objects in the packfile, showing their relationships and compression details.
Why It Matters
Packfiles significantly reduce disk usage, especially in large repositories with many similar files or commits. They’re why Git can store years of project history efficiently.
Pro Tip: Run git gc
manually to optimize your repository, but be cautious—it’s a one-way process, and objects may become harder to inspect individually.
8. Inspecting Git Internals with Plumbing Commands
Git provides two types of commands: porcelain (user-friendly, like git add
or git commit
) and plumbing (low-level, for scripting and debugging). Plumbing commands let you peek into Git’s internals.
Key Plumbing Commands
-
git hash-object -w <file>
: Computes the SHA-1 hash of a file and optionally writes it to.git/objects
. -
git cat-file -p <hash>
: Displays the content of an object (blob, tree, or commit). -
git ls-tree <tree-hash>
: Lists the contents of a tree object. -
git rev-list --objects --all
: Lists all objects in the repository’s history.
Example: Exploring Your Repository
Find a commit hash with git log
, then:
git cat-file -p <commit-hash> # View commit details
git ls-tree <tree-hash> # View the root tree
git cat-file -p <blob-hash> # View a file’s content
Why It Matters
Plumbing commands are your toolkit for debugging and understanding Git’s behavior, especially during complex operations like rebases or recovering lost commits.
Pro Tip: Use git rev-parse HEAD
to get the hash of the current commit, then explore its tree and blobs.
9. Putting It All Together
When you run git commit
, Git performs a series of steps to create a snapshot of your project:
-
Creates Blobs: Each modified file is hashed and stored as a blob in
.git/objects
. - Builds Trees: Git constructs tree objects to represent the directory structure, linking to blobs and subtrees.
- Creates a Commit: A commit object is created, referencing the root tree, parent commits, and metadata like the author and message.
Visual Summary
commit ---> tree ---> blobs
| |-- README.md
| |-- index.js
| +-- src/ (tree) ---> blobs (file1.js, file2.js)
This structure ensures that every commit is a complete, immutable snapshot of your project.
Pro Tip: To visualize this, run git log --oneline --graph
and then use git cat-file -p
on a commit to trace its tree and blobs.
10. Bonus: Snapshot vs. Diff — A Practical Demonstration
To understand Git’s snapshot-based approach, let’s try a hands-on example:
- Create and commit a file:
echo "hello world" > hello.txt
git add hello.txt
git commit -m "Add hello"
- Modify the file and recommit:
echo "hello again" > hello.txt
git add hello.txt
git commit -m "Change hello"
- Compare the blobs:
git log --oneline # Find the commit hashes
git cat-file -p <first-commit-hash> # Get the tree hash
git ls-tree <tree-hash> # Get the blob hash for hello.txt
git cat-file -p <blob-hash> # Outputs: hello world
Repeat for the second commit’s blob to see the new content (hello again
). Notice that the two blobs have different hashes because their content changed, but unchanged files would reuse the same blob.
Why It Matters
Git’s snapshot-based model (not diff-based) means it stores the full state of your project for each commit. Diffs are computed on demand, which makes operations like git blame
or git diff
flexible but computationally intensive.
Pro Tip: Use git diff <commit1> <commit2>
to see how Git computes differences between two snapshots.
11. Key Takeaways
- Commits are snapshots, not diffs, referencing a root tree that captures the entire project state.
- Trees organize the directory structure, linking to blobs and subtrees.
- Blobs store raw file content, independent of names or permissions.
- Git’s content-addressable storage and DAG enable efficient deduplication and history manipulation.
- Plumbing commands like
git cat-file
,git ls-tree
, andgit rev-list
unlock Git’s internals for debugging and exploration. - Understanding these concepts helps you tackle merge conflicts, optimize rebases, and recover lost data with confidence.
12. Final Thoughts
Git’s object model—blobs, trees, and commits—transforms it into a powerful, content-addressable database. By storing data as immutable, hash-addressed objects, Git ensures efficiency, resilience, and flexibility. The next time you run git commit
, take a moment to appreciate the elegant machinery at work: a deduplicated, compressed, and perfectly organized snapshot of your project’s history.
Mastering Git’s internals not only makes you a better developer but also empowers you to wield Git’s full potential, from resolving complex merge conflicts to scripting custom workflows.
13. Further Exploration
- Read the official Git Internals Book: Git Internals - Plumbing and Porcelain
- Read the Pro Git Book: Pro Git - Scott Chacon, Ben Straub
Top comments (0)