How Git Actually Stores Your Code: Blobs, Trees, and Commits

#webdev #tutorial

Most people picture Git as a tool that records changes — a stack of diffs layered on top of each other. That mental model is wrong, and it makes Git feel mysterious. Git is really a small key-value database that stores snapshots, and once you see the four object types it uses, commands like reset, checkout, and rebase stop being magic.

Git is a content-addressed object store

Everything Git tracks lives in .git/objects as an object, and every object has an ID that is the hash of its own content. By default that hash is a 40-character SHA-1 digest (newer Git supports SHA-256). The same bytes always produce the same ID, so the ID is the content's address — change one byte and you get a completely different object. This is why Git data is effectively immutable: you never edit an object in place, you create a new one with a new name.

You can look inside any object with git cat-file. The -t flag prints the type, -p pretty-prints the content:

$ git cat-file -t 3b18e512
blob
$ git cat-file -p 3b18e512
hello world

There are exactly four object types: blob, tree, commit, and tag.

A blob is just file contents — raw bytes, with no filename and no metadata. The blob for README.md knows nothing about being named README.md; it only knows what's inside.

A tree is a directory listing. It maps names to other objects: each entry has a mode (like a file vs. an executable vs. a subdirectory), a name, and the hash of either a blob (a file) or another tree (a subdirectory). Trees are how Git represents folder structure. Inspecting one shows exactly that:

$ git cat-file -p HEAD^{tree}
100644 blob a906cb...    README.md
040000 tree fe8e3b...    src

A commit ties it together. A commit object points to exactly one top-level tree (the full state of your project at that moment), plus the hash of its parent commit (or parents, for a merge), the author and committer with timestamps, and the commit message. Running git cat-file -p HEAD shows these fields in plain text. Because each commit names its parent, the commits form a chain — really a directed graph — and that graph is your history.

Snapshots, not diffs

Here is the part that surprises people: a commit stores a snapshot of your entire tree, not a diff against the previous commit. Each commit points to a complete tree describing every file in the project at that point.

That sounds wasteful, but it isn't, because of content addressing. If a file didn't change between two commits, its blob hash is identical, so both commits' trees point at the very same blob object. Git stores that blob once. The same goes for unchanged directories: an unchanged subdirectory yields an identical tree object, reused across commits. A commit that touches one file in a deep folder only creates new objects along that one path; everything else is shared by reference.

Two files with byte-for-byte identical content — say, the same empty __init__.py in twenty packages — hash to the same blob ID, so Git stores their contents exactly once. The same deduplication applies across your whole history: a file that survives a hundred commits unchanged exists as a single blob referenced a hundred times. Snapshots plus content-addressing give you the simplicity of full snapshots with storage close to a diff-based system. (Git also later compresses objects into packfiles using delta encoding, but that's a storage optimization layered under the object model, not the model itself.)

The diffs you see in git diff or git log -p are computed on the fly by comparing two snapshots. Git doesn't store them; it derives them when you ask.

Branches and HEAD are just pointers

If commits are immutable objects in a graph, what is a branch? Almost nothing. A branch is a ref — a small file under .git/refs/heads/ that contains a single commit hash. The branch main is literally a 40-character string naming the latest commit on that line of work.

git commit writes a new commit whose parent is the current one, then updates the branch ref to point at it. That's the whole operation. Creating a branch (git branch feature) just writes a new file with the same hash — which is why it's instant and cheap no matter how large the repo.

HEAD is one more layer of indirection: usually it's a file containing ref: refs/heads/main, meaning "I am on the branch main." When you checkout a different branch, Git rewrites HEAD to point at that ref and updates your working files to match its tree. A "detached HEAD" simply means HEAD holds a commit hash directly instead of pointing at a branch.

Once this clicks, the scary commands demystify. git reset moves a branch pointer to a different commit. git rebase replays commits to create new ones with new hashes (which is why it rewrites history). Nothing reaches into an object and mutates it — Git only ever creates new objects and moves pointers.