DEV Community

Cover image for How Git Stores Files Internally to Saves Space in Your Repository
Bhagirath
Bhagirath

Posted on

How Git Stores Files Internally to Saves Space in Your Repository

Learn how Git stores files internally using snapshots, blobs, trees, and hashing to avoid duplication and save repository space efficiently.

Git is the most widely used version control system in the world, and one of the key reasons for its popularity is its highly efficient storage model. At first glance, Git appears to store a complete copy of your project every time you commit. Surprisingly, repositories remain compact even after thousands of commits.

So how does Git duplicate files while still saving disk space?

In this article, we will explore how Git stores files internally, how it avoids unnecessary duplication, and why its storage mechanism is both fast and space-efficient. By the end, you will clearly understand how Git manages file data under the hood and why it scales so well for large projects.


Overview: How Git Stores Data Efficiently

Unlike traditional version control systems such as Subversion (SVN), which store file differences between versions, Git takes a fundamentally different approach.

Git stores snapshots of the entire project state at every commit.

However, Git is smart enough not to duplicate unchanged data. If a file has not changed between commits, Git simply reuses the previously stored version instead of saving a new copy. This design enables Git to deliver:

  • Faster operations (branching, merging, checkout)
  • Reduced disk usage
  • Strong data integrity and reliability

1. How Git Stores Data Using Snapshots Instead of File Differences

Most version control systems track line-by-line changes over time. Git does not.

Every time you create a commit, Git records a snapshot of the entire file structure at that moment.

What Happens When Files Don’t Change?

If a file remains unchanged between commits:

  • Git does not store the file again
  • Git simply creates a reference to the existing stored content

This means Git behaves like a content-addressable filesystem, where identical content is stored once and referenced many times.

Why This Matters

This snapshot model allows Git to:

  • Instantly switch between branches
  • Perform fast merges
  • Avoid recalculating diffs repeatedly

2. Git Object Model: How Files Are Stored Internally

Git stores all repository data as objects inside the .git/objects directory. Each object is identified by a cryptographic hash based on its content.

Git-internal-Objects

There are four primary object types in Git:

  • Blob — File contents
  • Tree — Directory structure
  • Commit — A snapshot with metadata
  • Tag — Named references to commits

2.1 Blob Objects: File Content Storage

A blob (Binary Large Object) represents the raw content of a file.

Key characteristics of blobs:

  • Store file data only (no filename or permissions)
  • Identical file contents result in identical blob hashes
  • Stored only once, regardless of how many commits reference them

Why Blobs Enable De-duplication

If two files — or the same file across commits — have identical content:

  • Git stores one blob
  • Multiple commits point to the same blob

This is the foundation of Git’s space-saving mechanism.

You can inspect blobs using:

git ls-tree <commit-hash>
Enter fullscreen mode Exit fullscreen mode

2.2 Tree Objects: Directory Structures

A tree object represents a directory in your project.

It contains:

  • File names
  • File permissions
  • References to blob objects
  • References to other tree objects (subdirectories)

Each directory in your project maps to a tree object, allowing Git to recreate the complete filesystem structure for any commit.

2.3 Commit Objects: Snapshots in Time

A commit object ties everything together.

It contains:

  • A reference to the root tree
  • Author and committer information
  • Commit message
  • Parent commit(s)

Commit Structure Example

Commit
└── Tree (Root Directory)
    ├── Blob (File 1)
    ├── Blob (File 2)
    └── Tree (Subdirectory)
        ├── Blob (File 3)
        └── Blob (File 4)
Enter fullscreen mode Exit fullscreen mode

Each commit represents a complete snapshot, but most data is reused from earlier commits.


3. Inside the .git Directory: Git’s Internal Storage and Control System

The .git directory is the core of every Git repository. It stores all metadata, objects, and references.

3.1 .git/objects/

This directory stores all Git objects (blobs, trees, commits) in compressed form. Objects are named using their hash values.

3.2 .git/refs/

References to branches and tags live here. Each branch is simply a pointer to a commit.

3.3 .git/index (Staging Area)

The index tracks what will be included in the next commit. It bridges the gap between your working directory and the repository.

3.4 .git/HEAD

The HEAD file points to the currently checked-out branch or commit.


4. How Git Uses Hashing, Compression, and De-duplication to Save Space

Git’s efficiency comes from three core techniques.

4.1 Content-Addressable Hashing

Git computes a hash (SHA-1 by default, SHA-256 supported) for every object based on its content.

  • Same content → same hash
  • Different content → different hash

This guarantees data integrity and prevents duplication.

Contnent-Addressable-Hashing

4.2 Object Compression

Git compresses objects using zlib, reducing disk usage while maintaining fast access.

4.3 Automatic De-duplication

Git never stores the same content twice. If a file hasn’t changed:

  • No new blob is created
  • Existing blobs are reused

This is how Git duplicates files logically without duplicating data physically.


5. From Working Directory to Commits: How Git Builds and Stores Snapshots

To fully understand how Git duplicates files while saving space, it is essential to understand the three logical areas through which every change flows: the working directory, the staging area, and the commit history. These are not just conceptual layers — they directly influence how Git creates objects and reuses existing data.

Working-Directory

5.1 Working Directory

The working directory is the actual project folder on your local machine. It contains real files that you edit using your editor or IDE.

Key characteristics:

  • Files here exist outside of Git’s object database
  • Changes are not tracked automatically
  • Git does not store anything permanently at this stage

When you modify a file in the working directory:

  • Git detects the change
  • No new blob is created yet
  • No disk space inside .git/objects is used

This design allows Git to remain fast and lightweight while you experiment with changes.

5.2 Staging Area (Index)

The staging area, also called the index, is where Git begins its internal storage optimization.

When you run:

git add <file>
Enter fullscreen mode Exit fullscreen mode

Git performs the following actions:

  • Reads the file content from the working directory
  • Computes a hash based on the content
  • Checks whether an identical blob already exists
  • Reuses the existing blob or creates a new one if needed
  • Records the blob reference in .git/index

Important details:

  • The staging area stores references, not copies
  • Unchanged files reuse existing blob objects
  • Partial staging is supported, allowing fine-grained commits

This is where Git’s de-duplication logic begins to take effect.\

5.3 Commit History

When you run:

git commit
Enter fullscreen mode Exit fullscreen mode

Git creates a commit object, which includes:

  • A reference to a tree object
  • Metadata (author, timestamp, message)
  • A reference to the parent commit

Crucially:

  • Git does not duplicate file content
  • The new tree references existing blobs whenever possible
  • Only changed files produce new blobs

Each commit represents a complete snapshot, but internally, most data is shared across commits. This allows Git to maintain a full project history without ballooning repository size.


6. Exploring Git’s Internals Using Low-Level Git Commands

One of Git’s strengths is transparency. Git provides low-level commands that allow you to inspect its internal object database, making it easier to understand how files are stored and reused.

These commands are especially valuable for developers who want to understand Git beyond everyday workflows.

6.1 git cat-file: Viewing Raw Git Objects

The git cat-file command allows you to inspect any Git object directly.

To view a commit object:

git cat-file -p <object-hash>
Enter fullscreen mode Exit fullscreen mode

This displays:

  • The referenced tree
  • Parent commit
  • Author and committer details
  • Commit message

You can also inspect blob objects to see file content exactly as Git stores it, confirming that identical content is reused across commits.

6.2 git ls-tree: Exploring Tree Structures

The git ls-tree command shows how a commit or tree maps to files and directories.

git ls-tree <commit-hash>
Enter fullscreen mode Exit fullscreen mode

Output includes:

  • File permissions
  • Object type (blob or tree)
  • Object hash
  • File or directory name

This command clearly demonstrates how Git builds directory snapshots using tree objects that reference blob objects, without duplicating data.

6.3 git rev-parse: Resolving References to Hashes

The git rev-parse command helps resolve symbolic references into their actual object hashes.

git rev-parse HEAD
Enter fullscreen mode Exit fullscreen mode

Use cases include:

  • Verifying which commit a branch points to
  • Debugging detached HEAD states
  • Understanding reference resolution

This reinforces the idea that branches and tags are lightweight pointers, not copies of data.


Conclusion: Why Git’s Storage Model Is So Powerful

Git’s ability to duplicate files logically without duplicating data physically is the cornerstone of its performance and scalability. By storing content as immutable, hashed objects and reusing them across commits, Git ensures that repositories remain fast and space-efficient — even with extensive histories.

Key Takeaways

  • Git stores snapshots, not file diffs
  • Identical file content is stored only once and reused
  • Blobs, trees, and commits form Git’s object model
  • The .git directory contains all internal data
  • Hashing and compression ensure integrity and efficiency

Understanding Git’s internal storage model gives you deeper confidence when working with branches, rebases, merges, and large repositories. It also explains why Git continues to outperform traditional version control systems in both speed and reliability.

Top comments (0)