Bhagirath

Posted on Jan 15

How Git Stores Files Internally to Saves Space in Your Repository

#git #github #versioncontrol #productivity

Learn how Git stores files internally using snapshots, blobs, trees, and hashing to avoid duplication and save repository space efficiently.

Git is the most widely used version control system in the world, and one of the key reasons for its popularity is its highly efficient storage model. At first glance, Git appears to store a complete copy of your project every time you commit. Surprisingly, repositories remain compact even after thousands of commits.

So how does Git duplicate files while still saving disk space?

In this article, we will explore how Git stores files internally, how it avoids unnecessary duplication, and why its storage mechanism is both fast and space-efficient. By the end, you will clearly understand how Git manages file data under the hood and why it scales so well for large projects.

Overview: How Git Stores Data Efficiently

Unlike traditional version control systems such as Subversion (SVN), which store file differences between versions, Git takes a fundamentally different approach.

Git stores snapshots of the entire project state at every commit.

However, Git is smart enough not to duplicate unchanged data. If a file has not changed between commits, Git simply reuses the previously stored version instead of saving a new copy. This design enables Git to deliver:

Faster operations (branching, merging, checkout)
Reduced disk usage
Strong data integrity and reliability

1. How Git Stores Data Using Snapshots Instead of File Differences

Most version control systems track line-by-line changes over time. Git does not.

Every time you create a commit, Git records a snapshot of the entire file structure at that moment.

What Happens When Files Don’t Change?

If a file remains unchanged between commits:

Git does not store the file again
Git simply creates a reference to the existing stored content

This means Git behaves like a content-addressable filesystem, where identical content is stored once and referenced many times.

Why This Matters

This snapshot model allows Git to:

Instantly switch between branches
Perform fast merges
Avoid recalculating diffs repeatedly

2. Git Object Model: How Files Are Stored Internally

Git stores all repository data as objects inside the .git/objects directory. Each object is identified by a cryptographic hash based on its content.

There are four primary object types in Git:

Blob — File contents
Tree — Directory structure
Commit — A snapshot with metadata
Tag — Named references to commits

2.1 Blob Objects: File Content Storage

A blob (Binary Large Object) represents the raw content of a file.

Key characteristics of blobs:

Store file data only (no filename or permissions)
Identical file contents result in identical blob hashes
Stored only once, regardless of how many commits reference them

Why Blobs Enable De-duplication

If two files — or the same file across commits — have identical content:

Git stores one blob
Multiple commits point to the same blob

This is the foundation of Git’s space-saving mechanism.

You can inspect blobs using:

git ls-tree <commit-hash>

2.2 Tree Objects: Directory Structures

A tree object represents a directory in your project.

It contains:

File names
File permissions
References to blob objects
References to other tree objects (subdirectories)

Each directory in your project maps to a tree object, allowing Git to recreate the complete filesystem structure for any commit.

2.3 Commit Objects: Snapshots in Time

A commit object ties everything together.

It contains:

A reference to the root tree
Author and committer information
Commit message
Parent commit(s)

Commit Structure Example

Commit
└── Tree (Root Directory)
    ├── Blob (File 1)
    ├── Blob (File 2)
    └── Tree (Subdirectory)
        ├── Blob (File 3)
        └── Blob (File 4)

Each commit represents a complete snapshot, but most data is reused from earlier commits.

3. Inside the `.git` Directory: Git’s Internal Storage and Control System

The .git directory is the core of every Git repository. It stores all metadata, objects, and references.

3.1 `.git/objects/`

This directory stores all Git objects (blobs, trees, commits) in compressed form. Objects are named using their hash values.

3.2 `.git/refs/`

References to branches and tags live here. Each branch is simply a pointer to a commit.

3.3 `.git/index` (Staging Area)

The index tracks what will be included in the next commit. It bridges the gap between your working directory and the repository.

3.4 `.git/HEAD`

The HEAD file points to the currently checked-out branch or commit.

4. How Git Uses Hashing, Compression, and De-duplication to Save Space

Git’s efficiency comes from three core techniques.

4.1 Content-Addressable Hashing

Git computes a hash (SHA-1 by default, SHA-256 supported) for every object based on its content.

Same content → same hash
Different content → different hash

This guarantees data integrity and prevents duplication.

4.2 Object Compression

Git compresses objects using zlib, reducing disk usage while maintaining fast access.

4.3 Automatic De-duplication

Git never stores the same content twice. If a file hasn’t changed:

No new blob is created
Existing blobs are reused

This is how Git duplicates files logically without duplicating data physically.

5. From Working Directory to Commits: How Git Builds and Stores Snapshots

To fully understand how Git duplicates files while saving space, it is essential to understand the three logical areas through which every change flows: the working directory, the staging area, and the commit history. These are not just conceptual layers — they directly influence how Git creates objects and reuses existing data.

5.1 Working Directory

The working directory is the actual project folder on your local machine. It contains real files that you edit using your editor or IDE.

Key characteristics:

Files here exist outside of Git’s object database
Changes are not tracked automatically
Git does not store anything permanently at this stage

When you modify a file in the working directory:

Git detects the change
No new blob is created yet
No disk space inside .git/objects is used

This design allows Git to remain fast and lightweight while you experiment with changes.

5.2 Staging Area (Index)

The staging area, also called the index, is where Git begins its internal storage optimization.

When you run:

git add <file>

Git performs the following actions:

Reads the file content from the working directory
Computes a hash based on the content
Checks whether an identical blob already exists
Reuses the existing blob or creates a new one if needed
Records the blob reference in .git/index

Important details:

The staging area stores references, not copies
Unchanged files reuse existing blob objects
Partial staging is supported, allowing fine-grained commits

This is where Git’s de-duplication logic begins to take effect.\

5.3 Commit History

When you run:

git commit

Git creates a commit object, which includes:

A reference to a tree object
Metadata (author, timestamp, message)
A reference to the parent commit

Crucially:

Git does not duplicate file content
The new tree references existing blobs whenever possible
Only changed files produce new blobs

Each commit represents a complete snapshot, but internally, most data is shared across commits. This allows Git to maintain a full project history without ballooning repository size.

6. Exploring Git’s Internals Using Low-Level Git Commands

One of Git’s strengths is transparency. Git provides low-level commands that allow you to inspect its internal object database, making it easier to understand how files are stored and reused.

These commands are especially valuable for developers who want to understand Git beyond everyday workflows.

6.1 `git cat-file`: Viewing Raw Git Objects

The git cat-file command allows you to inspect any Git object directly.

To view a commit object:

git cat-file -p <object-hash>

This displays:

The referenced tree
Parent commit
Author and committer details
Commit message

You can also inspect blob objects to see file content exactly as Git stores it, confirming that identical content is reused across commits.

6.2 `git ls-tree`: Exploring Tree Structures

The git ls-tree command shows how a commit or tree maps to files and directories.

git ls-tree <commit-hash>

Output includes:

File permissions
Object type (blob or tree)
Object hash
File or directory name

This command clearly demonstrates how Git builds directory snapshots using tree objects that reference blob objects, without duplicating data.

6.3 `git rev-parse`: Resolving References to Hashes

The git rev-parse command helps resolve symbolic references into their actual object hashes.

git rev-parse HEAD

Use cases include:

Verifying which commit a branch points to
Debugging detached HEAD states
Understanding reference resolution

This reinforces the idea that branches and tags are lightweight pointers, not copies of data.

Conclusion: Why Git’s Storage Model Is So Powerful

Git’s ability to duplicate files logically without duplicating data physically is the cornerstone of its performance and scalability. By storing content as immutable, hashed objects and reusing them across commits, Git ensures that repositories remain fast and space-efficient — even with extensive histories.

Key Takeaways

Git stores snapshots, not file diffs
Identical file content is stored only once and reused
Blobs, trees, and commits form Git’s object model
The .git directory contains all internal data
Hashing and compression ensure integrity and efficiency

Understanding Git’s internal storage model gives you deeper confidence when working with branches, rebases, merges, and large repositories. It also explains why Git continues to outperform traditional version control systems in both speed and reliability.

DEV Community

How Git Stores Files Internally to Saves Space in Your Repository

Overview: How Git Stores Data Efficiently

1. How Git Stores Data Using Snapshots Instead of File Differences

What Happens When Files Don’t Change?

Why This Matters

2. Git Object Model: How Files Are Stored Internally

2.1 Blob Objects: File Content Storage

Why Blobs Enable De-duplication

2.2 Tree Objects: Directory Structures

2.3 Commit Objects: Snapshots in Time

3. Inside the `.git` Directory: Git’s Internal Storage and Control System

3.1 `.git/objects/`

3.2 `.git/refs/`

3.3 `.git/index` (Staging Area)

3.4 `.git/HEAD`

4. How Git Uses Hashing, Compression, and De-duplication to Save Space

4.1 Content-Addressable Hashing

4.2 Object Compression

4.3 Automatic De-duplication

5. From Working Directory to Commits: How Git Builds and Stores Snapshots

5.1 Working Directory

5.2 Staging Area (Index)

5.3 Commit History

6. Exploring Git’s Internals Using Low-Level Git Commands

6.1 `git cat-file`: Viewing Raw Git Objects

6.2 `git ls-tree`: Exploring Tree Structures

6.3 `git rev-parse`: Resolving References to Hashes

Conclusion: Why Git’s Storage Model Is So Powerful

Key Takeaways

Top comments (0)

Overview: How Git Stores Data Efficiently

1. How Git Stores Data Using Snapshots Instead of File Differences

What Happens When Files Don’t Change?

Why This Matters

2. Git Object Model: How Files Are Stored Internally

2.1 Blob Objects: File Content Storage

Why Blobs Enable De-duplication

2.2 Tree Objects: Directory Structures

2.3 Commit Objects: Snapshots in Time

3. Inside the .git Directory: Git’s Internal Storage and Control System

3.1 .git/objects/

3.2 .git/refs/

3.3 .git/index (Staging Area)

3.4 .git/HEAD

4. How Git Uses Hashing, Compression, and De-duplication to Save Space

4.1 Content-Addressable Hashing

4.2 Object Compression

4.3 Automatic De-duplication

5. From Working Directory to Commits: How Git Builds and Stores Snapshots

5.1 Working Directory

5.2 Staging Area (Index)

5.3 Commit History

6. Exploring Git’s Internals Using Low-Level Git Commands

6.1 git cat-file: Viewing Raw Git Objects

6.2 git ls-tree: Exploring Tree Structures

6.3 git rev-parse: Resolving References to Hashes

Conclusion: Why Git’s Storage Model Is So Powerful

Key Takeaways

3. Inside the `.git` Directory: Git’s Internal Storage and Control System

3.1 `.git/objects/`

3.2 `.git/refs/`

3.3 `.git/index` (Staging Area)

3.4 `.git/HEAD`

6.1 `git cat-file`: Viewing Raw Git Objects

6.2 `git ls-tree`: Exploring Tree Structures

6.3 `git rev-parse`: Resolving References to Hashes