DEV Community

Cover image for Git Internals: Why Your Commits Aren't Actually Diffs
Doogal Simpson
Doogal Simpson

Posted on • Originally published at doogal.dev

Git Internals: Why Your Commits Aren't Actually Diffs

TL;DR: Git is a content-addressable filesystem that stores project states as full snapshots rather than incremental deltas. Every object—blobs, trees, and commits—is identified by a unique SHA-1 hash of its content, creating an immutable chain where any change to a single byte results in entirely new objects.

You see green and red lines in a pull request and assume Git stores diffs. It doesn't. I view Git as a persistent key-value store where the key is a hash and the value is your data. When I commit code, I am not saving a list of changes; I am saving a snapshot of the entire project state at that exact moment.

How does Git store actual file content?

Git ignores filenames and stores raw data as 'blobs' named after their own SHA-1 hashes.

When I create a file called hello.txt with the content "hello world" and add it to a repo, Git hashes that string and creates a blob. If I look inside the .git/objects directory, I can see exactly how this is stored. Git takes the first two characters of the hash to create a directory and uses the remaining 38 characters as the filename. For example, a hash starting with e69de2 would be stored at .git/objects/e6/9de29.... This is the core of content-addressable storage: the address of the data is derived from the data itself. If I change a single character in that file, the hash changes, and Git writes an entirely new blob file.

What role does a tree object play?

A tree object defines the project structure by mapping human-readable filenames to their specific blob or sub-tree hashes.

I think of a tree as a simple directory listing. Each line records file permissions, the object type, the SHA-1 hash, and the filename. This architecture is why I can rename a file without Git needing to copy the actual file data. The blob hash remains identical because the content "hello world" hasn't changed; I have only updated the tree object to point that same hash to a new filename. Because trees are also named after the hash of their content, any change to a filename or a sub-directory hash results in a brand-new tree hash.

Object Data Responsibility Identity Hash Source
Blob Stores raw file bytes The literal file content
Tree Maps names to hashes The directory list content
Commit Links trees to parents The metadata and tree pointer

What happens when a file is modified?

Modifying a single byte triggers a cascade where a new blob, a new tree, and a new commit are all created with unique hashes.

If I update hello.txt from "hello world" to "hello world, how are you?", the system rebuilds the state of the world. Git writes a new blob for the updated string. Because the hash for hello.txt is now different, the tree containing it must be updated, resulting in a new tree hash. Finally, I create a new commit pointing to that new root tree. This commit also stores the hash of its parent commit. This pointer is what creates the chain we call history. Because the parent hash is part of the commit's content, if I change one bit in an old commit, its hash changes, breaking every subsequent hash in the chain. This immutability is why Git history is so mathematically consistent.

FAQ

Does Git's snapshot model waste a lot of disk space?

Git periodically runs a garbage collection process (git gc) that packs objects into compressed files. While it uses delta compression for physical storage, Git maintains the snapshot model at the logical level, ensuring data retrieval is fast and consistent.

How does Git know if a file hasn't changed?

When I run a commit, Git compares the current hash of a file's content to the hash stored in the previous tree. If they match, Git simply reuses the existing hash in the new tree object rather than creating a redundant blob.

Why are commit hashes unique across different machines?

Since the hash is derived from the content—including the tree hash, author, timestamp, and parent hash—the identity is unique to that specific snapshot. This allows developers to work asynchronously without a central server assigning version numbers.

Top comments (0)