Walter Hrad

Posted on Jul 5

How Git Actually Works Under the Hood

#computerscience #git #programming #tutorial

Most developers use Git every day and understand almost none of it. That's not an insult, it's just the reality of how most people learn tools. You pick up the commands that get you through the day, you memorize the ones that fix the situations you keep breaking, and you build a working mental model that is almost entirely wrong at the mechanical level.

The mental model most people carry looks something like this: Git tracks changes to files. When you commit, it saves a snapshot of what changed. Branches are pointers to different lines of work. That's roughly correct at a surface level, but it skips over the actual machinery in a way that leaves you confused every time something unexpected happens. Why does rebasing rewrite history? Why are commits immutable? Why does detached HEAD state exist? Why can you lose work in ways that feel impossible if Git is just tracking changes?

The answers are all in the object model, and the object model is surprisingly simple once you sit with it.

Git is a content-addressable filesystem

Before any of the version control concepts, Git is a key-value store. You put content in, you get a hash back. You use that hash later to retrieve the content. That's the entire foundation, and everything else is built on top of it.

The hash Git uses is SHA-1, producing a 40-character hexadecimal string. When you run git hash-object on a file, Git takes the content, prepends a small header describing the object type and size, and runs SHA-1 over the whole thing. The resulting hash is both the key and the identity of that content. Two files with identical content will always produce the same hash. A file whose content changes even slightly will produce a completely different hash.

This is the first thing that breaks people's mental models. In most storage systems, identity is location: a file is "that file" because it lives at that path. In Git's object store, identity is content. The path a file lives at is separate metadata, not the file's identity.

All of Git's objects live in .git/objects. Go look at it sometime on a real repository. You'll find subdirectories named with two-character hex prefixes, and inside each one, files named with the remaining 38 characters of various hashes. Each of those files is a compressed Git object. The entire history of your project, every version of every file that ever existed, every commit, every tree, is sitting right there in that directory as a pile of content-addressed blobs.

The four object types

Git has exactly four types of objects: blobs, trees, commits, and tags. That's it. The entire version control system is built from those four things.

Blobs

A blob is file content, nothing else. Not a filename, not a path, not permissions. Just the raw bytes of a file at a particular moment in time.

If you have two files in your repository with the same content, they share a single blob. Git doesn't store duplicates. The filename and location of those files are stored elsewhere. The blob itself is just content.

This is why Git is efficient in ways that surprise people. If you have a thousand files and you change one of them, Git only needs to store one new blob. The other 999 files haven't changed, so their blobs already exist in the object store, and the new tree structure just references them by hash. Nothing is copied.

You can look at a blob directly using git cat-file -p <hash>. Run git ls-files --stage in any repository and you'll see the blob hashes for every file in your current index alongside their filenames and modes. Pick any hash from that list and cat-file it, and you'll see the raw file content.

Trees

A tree is Git's representation of a directory. It contains a list of entries, where each entry is a mode, a type, a hash, and a name. The type is either blob or tree, because directories can contain files and other directories.

A tree for a simple directory with two files and one subdirectory might look like this:

100644 blob a8c6a8d9...    README.md
100644 blob 3f1b2c4e...    main.go
040000 tree 9d2e1f7a...    internal

That tree has a hash. The subdirectory internal is itself another tree object with its own hash. That tree has a hash. Every file in it is a blob with a hash. The whole directory structure of your project at any given moment is represented as a tree of hashes pointing to other hashes.

This structure is a Merkle tree, the same data structure that shows up in Bitcoin and a lot of other systems where you need to verify large amounts of data efficiently. If any blob anywhere in your directory tree changes, its hash changes, which changes the hash of the tree containing it, which changes the hash of any parent tree, which changes the hash of the root tree. The root tree hash is a fingerprint of your entire directory structure at that moment. If two root tree hashes are equal, every file in every directory is byte-for-byte identical.

Commits

A commit object contains four things: a pointer to a root tree, zero or more pointers to parent commits, author and committer metadata, and a commit message.

The pointer to the root tree is what gives a commit its snapshot of the entire project. When you check out a commit, Git reads that commit's tree hash, then recursively resolves all the trees and blobs in it, and reconstructs your working directory from those objects. There's no concept of a "diff" in the object store. Every commit has a full snapshot, but because blobs and trees are deduplicated by content, the actual storage cost of a new commit is only the objects that didn't exist before.

The pointer to parent commits is what makes the history graph. A regular commit has one parent, the previous commit. A merge commit has two or more parents, one for each branch that was merged. The first commit in a repository has no parent.

This is the second thing that breaks mental models. The history graph is not a sequence of diffs. It is a directed acyclic graph of snapshots, where each node contains a full picture of the entire project, and edges point backwards in time to parent commits. When you ask Git to show you what changed between two commits, it reconstructs both snapshots from their respective trees and computes the diff on the fly. The diff is not stored anywhere. It's derived.

Commits are also immutable. Once a commit object exists with a given hash, it cannot be changed, because any change to its content would change its hash, making it a different object. When you amend a commit, you're not modifying the existing commit. You're creating a new commit object and moving the branch pointer to it. The old commit still exists in the object store until it gets garbage collected.

References: the human layer on top of hashes

Hashes are how Git thinks about objects internally. Humans are bad at hashes. References are the layer that makes Git usable for humans.

A reference is a file containing a hash. That's all it is. The file .git/refs/heads/main contains the SHA-1 hash of the commit that the main branch currently points to. The file .git/refs/tags/v1.0.0 contains a hash. The file .git/HEAD contains either a hash (when you're in detached HEAD state) or a symbolic reference to another ref file (when you're on a branch, it contains something like ref: refs/heads/main).

When you create a new commit on the main branch, Git creates the commit object, writes its hash into .git/refs/heads/main, and that's the entire operation of "advancing the branch." The branch didn't grow. The branch pointer moved.

Branches in Git are not containers. They're not timelines. They're not parallel universes of code. They're a single file containing a single hash. When people say a branch is "just a pointer," they're being completely literal. Go look at .git/refs/heads in any repository. Every branch is a file. Open any of those files. It contains exactly one hash.

This is why creating and deleting branches in Git is so cheap compared to other version control systems. There's no data to copy, no history to replicate. Creating a branch is creating a file. Deleting a branch is deleting a file. The commits the branch pointed to are unaffected, they still exist in the object store.

HEAD and what detached HEAD actually means

HEAD is a special reference that tells Git where you currently are. In the normal case, HEAD contains a symbolic reference to a branch, something like ref: refs/heads/main. Git calls this being "attached" to a branch. When you make a commit, Git creates the commit object, updates the branch ref to point to it, and HEAD follows because HEAD points to the branch, not directly to a commit.

Detached HEAD happens when HEAD contains a commit hash directly instead of a branch reference. This happens when you check out a specific commit, a tag, or a remote tracking branch. Git tells you about it because it has consequences: if you make commits in this state, those commits don't belong to any branch. They exist in the object store, and HEAD advances as you commit, but no named reference tracks them. When you switch back to a branch, HEAD moves to that branch, and your detached commits are now orphaned. They'll get garbage collected eventually unless you create a branch pointing to them first.

Once you understand what HEAD actually is, detached HEAD goes from a scary warning to a completely sensible description of what's happening. You're not attached to anything. You're floating at a specific commit with no branch to record where you go from here.

How staging actually works

The staging area, also called the index, is one of the more misunderstood parts of Git. Most people think of it as a temporary holding area for changes on the way to a commit. That's not wrong, but it undersells what it actually is.

The index is a binary file at .git/index that represents a complete snapshot of your project. It contains an entry for every tracked file: the file's path, its mode, its blob hash, and some stat information from the filesystem that Git uses to detect changes without hashing every file on every status check.

When you run git add, Git takes the current content of that file, creates a blob object for it in the object store, and updates the index entry for that file to point to the new blob hash. Nothing else happens. No commit is created. Just a blob object and an updated index entry.

When you run git commit, Git takes the current state of the index, constructs a tree object hierarchy from it, creates a commit object pointing to the root tree and the current HEAD commit as parent, writes the new commit hash to the current branch ref, and updates HEAD.

The index is essentially a proposed tree. It's the snapshot you're about to commit. That's why partial staging makes sense mechanically: you're selectively updating the index to contain some changes but not others, constructing the exact snapshot you want the commit to represent.

How merging works

A merge takes two commits and produces a new commit with both of them as parents. But to know what to put in that new commit's tree, Git has to figure out what changed in each branch relative to their common ancestor.

Finding that common ancestor is the job of the merge base algorithm. Git walks the commit graph backwards from both commits simultaneously, looking for the first commit that appears in both paths. Once it has the merge base, it diffs each branch tip against the merge base to find what changed in each, then applies both sets of changes to produce the merged result.

When the two branches changed different parts of different files, the merge is automatic. When they changed overlapping parts of the same file, you get a conflict, because Git can't automatically decide whose change should win.

A fast-forward merge is a special case where one commit is a direct ancestor of the other. If you're on main and you merge a feature branch, and main hasn't moved since you branched off, there are no divergent changes to combine. Git can just move the main pointer forward to the feature branch's tip. No merge commit is created because no actual merging needed to happen.

How rebasing works and why it rewrites history

Rebasing is where a lot of people get confused, and the confusion usually comes from thinking about branches as containers rather than pointers.

When you rebase a branch onto another, Git takes the commits in your branch that aren't in the target, and replays them one by one on top of the target. For each commit, it computes the diff between that commit and its parent, then applies that diff on top of the current tip of the target, creating a new commit object.

The key word is "new." The replayed commits are new objects. They have different parent hashes (because their parents changed), which means they have different hashes themselves. The original commits still exist in the object store, they're just no longer reachable from any branch. You haven't moved your commits. You've created copies of them in a new position and moved your branch pointer to the last copy.

This is what "rewriting history" means. The commits that exist after a rebase are different objects than the commits that existed before, even if their content is identical. If someone else had pulled your branch before you rebased, they have references to the old commits. After your rebase, those old commits no longer appear in your branch history. Their local branch and your remote branch have diverged. This is why rebasing shared branches causes problems: you've replaced the objects other people are pointing at with new objects, and their local Git has no way of knowing the new ones are meant to replace the old ones.

How git gc and the reflog save you

Git's object store is append-only during normal operation. Every blob, tree, commit, and tag you ever create lives in .git/objects until Git decides to clean up. Objects that aren't reachable from any reference are called loose objects or dangling objects, and they accumulate over time from operations like amending commits, rebasing, and resetting branches.

The reflog is what gives you a window to recover from those operations before cleanup happens. Every time HEAD or a branch ref moves, Git appends an entry to the reflog recording where the ref was before and where it moved to. The reflog for HEAD lives at .git/logs/HEAD. Run git reflog and you'll see a timestamped history of every position HEAD has ever been at in that repository.

When you do a hard reset and realize you needed those commits, or you rebase and want to get back to the pre-rebase state, the reflog is how you find the hash of the commit you want to return to. That commit still exists in the object store. You just need the hash to reach it.

git gc is the garbage collector. It finds objects with no references pointing to them (directly or through the reflog, which has its own expiry) and deletes them. By default, objects that are more than 30 days old and unreachable from any ref or reflog entry are eligible for collection. This is the window you have to recover from mistakes. It's also why truly losing work in Git is harder than it feels in the moment: Git is quite conservative about actually deleting anything.

Packfiles and how Git stores history efficiently

If every object is stored as a separate compressed file, large repositories with long histories would be enormous. Git handles this with packfiles.

A packfile is a single file that contains many objects stored together with delta compression. Instead of storing each version of a file as a complete compressed blob, Git can store one version in full and then store other versions as deltas relative to it. For files that change incrementally over time, this is dramatically more space efficient than storing each version separately.

Packfiles get created by git gc, by git repack, and automatically by Git when the number of loose objects crosses a threshold. When Git reads from a packfile, it reconstructs the requested object on the fly from whatever deltas are needed. From your perspective as a user, this is invisible. You use the same commands, request objects by the same hashes, and Git handles the physical storage transparently.

The index file that accompanies each packfile allows Git to binary search for any object hash in the pack without reading the entire file. This is how Git can efficiently access any object in a repository with millions of commits and hundreds of thousands of files.

What this changes about how you use Git

Understanding the object model doesn't make you memorize fewer commands. It does make the commands make sense in a way that changes how you work.

You stop being afraid of rebasing once you understand it's just replaying commits in a new position, not shuffling some fragile linear history. You stop being afraid of resetting once you know the reflog has your back and objects don't disappear immediately. You stop being confused by detached HEAD because you know exactly what HEAD is and what "detached" describes. You understand why git push --force is dangerous on shared branches, because you understand that it replaces remote refs with hashes pointing to new objects, orphaning whatever the old hashes pointed to.

You also start reading error messages differently. When Git says "your branch and origin/main have diverged," you know that means the commit graphs have branched: the local ref and the remote ref point to different commits that don't have a simple ancestor relationship. When Git says a ref is not valid, you can go look in .git/refs and see for yourself what's there and what isn't. When you need to find a lost commit, you know to check the reflog rather than feeling like something is gone forever.

Git's design is not arbitrary. The object model, the content addressing, the immutability of objects, the lightweight references, the reflog, the packfiles: all of it fits together into a system that is remarkably consistent once you have the right mental model. The learning curve exists because most people learn Git from the outside, picking up commands without ever looking at what the commands are actually doing to the files in .git. Going the other direction, starting with the objects and working outward to the commands, is a slower way to start but a much more durable way to understand.

The next time something goes wrong in Git and your instinct is to blow away the repository and clone fresh, stop and think about what objects exist and what references point to what. The answer to what happened and how to fix it is almost always there, in plain text, sitting in .git.

DEV Community