I didn't understand Git until I broke it open and rebuilt it from scratch. No libraries, no shortcuts - just SHA-256 hashing, tree structures, and commit graphs.
I built a Git implementation in Go without using any Git libraries. No magic, just content-addressable storage and the object model. It works, it taught me more about Git in a week than years of using it, and here's what I learned.
How I Learned This
This wasn't from a single source. I pieced it together from:
- CodeCrafters - Their "Build Your Own Git" challenge gave me the structure and pushed me to actually implement things
- Random YouTube videos - Honestly, just searching "how git works internally" and watching whatever came up. Some were helpful, most weren't
- "Building Git" book - Wasn't really helpful for what I needed, but it did clarify some object format details
Most of the learning came from trial and error. Breaking things, reading error messages, and debugging for hours.
Why Build This?
I use Git every day but had no idea how it actually works. Just git add, git commit, and hope nothing breaks. I wanted to understand what's really happening under the hood.
The goal wasn't to make something production-ready. I just wanted answers:
- Why are commits so cheap?
- How does Git deduplicate files automatically?
- What the hell is a "tree object"?
- Why is branching fast?
Turns out the best way to understand something is to build it yourself.
What It Does
Go-Git implements the core stuff:
go-git init # Initialize repository
go-git config # Set user identity
go-git add <files...> # Stage files
go-git commit -m "message" # Create commit
go-git log # View history
It handles content-addressable storage, the staging area, tree objects, commit history, and zlib compression. Basically everything Git does to manage your code, minus branches, merges, and remotes.
The Three Main Ideas
1. Content-Addressable Storage
Every object (file, directory, commit) gets stored by its SHA-256 hash:
.git/objects/ab/c123def456...
↑↑ ↑↑↑↑↑↑↑
│ └─ Rest of hash (filename)
└───── First 2 chars (subdirectory)
This is actually genius. Same content = same hash = automatic deduplication. You could store the same README.md across 100 commits and it only takes up disk space once.
Your file's hash IS its address. No need for a separate indexing system.
2. The Three Trees
Git tracks files through three layers:
Working Directory → Staging Area → Repository
(your files) (.git/index) (.git/objects)
-
go-git addmoves files from working directory → staging area -
go-git commitsnapshots staging area → repository
The staging area is literally just a text file mapping paths to hashes:
100644 abc123... README.md
100644 def456... src/main.go
When you commit, Git hashes this whole thing into a tree object.
3. Tree Objects (The Hard Part)
Files get stored as blobs. Directories get stored as trees.
Simple project structure:
project/
README.md
src/
main.go
lib/
helper.go
Git stores it like this:
Commit (abc123)
↓
Root Tree (def456)
├─ blob: README.md (hash: abc123)
└─ tree: src/ (hash: def456)
├─ blob: main.go (hash: ghi789)
└─ tree: lib/ (hash: jkl012)
└─ blob: helper.go (hash: mno345)
Here's what finally clicked for me: Trees don't contain their children - they just reference them by hash. That's why Git is fast. If a directory doesn't change, same hash, just reuse the tree. No need to re-store anything.
The tricky part: You have to build trees bottom-up (deepest first) because parent trees need their children's hashes.
You can't hash src/ until you know the hash of src/lib/. Can't hash the root tree until you know the hash of src/. The order matters, period.
This took me over 5 hours to get right.
What Was Hard
Tree Building Order
First try: Build trees top-down, starting from root. Immediately failed - you don't have the child tree hashes yet.
Fix: Sort directories by depth, build the deepest ones first:
sort.Slice(dirs, func(i, j int) bool {
return strings.Count(dirs[i], "/") > strings.Count(dirs[j], "/")
})
Then for each directory, check if any trees you've already built are its children and add them.
Binary vs Hex Encoding
Tree objects store hashes as 32 binary bytes, not 64-character hex strings.
My bug:
content += entry.BlobHash // Wrong! This is a 64-char hex string
The fix:
hashBytes, _ := hex.DecodeString(entry.BlobHash)
content = append(content, hashBytes...) // 32 binary bytes
This made my trees twice as big as they should've been. Spent an hour debugging because it was subtle - objects were still readable but tree traversal was completely broken.
Excluding .git/ When Staging
When you do go-git add ., you don't want to accidentally add .git/objects/ to the index.
First attempt: Check if the path contains .git. Problem: this also excluded files like my.git.file.
The right way:
if d.Name() == ".git" && d.IsDir() {
return filepath.SkipDir
}
filepath.SkipDir during directory traversal is the way to go.
How Objects Work
Every object follows this format:
<type> <size>\0<content>
Gets compressed with zlib, stored at .git/objects/<hash[:2]>/<hash[2:]>.
Blob:
blob 13\0Hello, World!
Hash it → a0b1c2d3... → Store at .git/objects/a0/b1c2d3...
Tree:
tree 74\0100644 README.md\0<32-byte-hash>040000 src\0<32-byte-hash>
Modes:
-
100644= regular file -
040000= directory (tree object)
Commit:
commit <size>\0
tree abc123...
parent 789xyz...
author Uthman <email> timestamp
committer Uthman <email> timestamp
Initial commit
Commits form a directed acyclic graph. go-git log just walks backwards through this chain from HEAD.
What I Actually Learned
Content-addressable storage is elegant. Hash = address. This one idea enables deduplication, integrity checking, and efficient storage.
Trees are graphs, not nested structures. A tree object doesn't "contain" subtrees - it references them by hash. This indirection is what makes Git efficient. Multiple commits can share the same tree if a directory didn't change.
Building bottom-up is necessary. You can't hash a parent without knowing its children's hashes. The order matters fundamentally.
Compression matters. Without zlib, .git/objects/ would be 3-4x larger. Git uses compression everywhere for a reason.
Binary formats are tricky. Working with \0 null bytes and binary hash data requires careful handling. Text formats would be easier but less efficient.
What's Missing
This is a learning project, not production software:
- No branches (only
mainexists) - No merge operations
- No diff/status commands
- No remote operations (push/pull/fetch)
- No
.gitignoresupport - No packed objects (each object is a separate file)
- Plain text index (real Git uses binary format)
These limitations exist because this project focuses on Git's core: the object model, staging, and commits. Adding branches/merging/remotes would be another 2-3x the code and shift focus from fundamentals to features.
Try It Yourself
git clone https://github.com/codetesla51/go-git.git
cd go-git
./install.sh
# Or build manually:
go build -buildvcs=false -o go-git
ln -s $(pwd)/go-git ~/.local/bin/go-git
Then:
mkdir my-project
cd my-project
go-git init
go-git config
echo "Hello World" > README.md
go-git add README.md
go-git commit -m "Initial commit"
go-git log
Want to really learn? Clone it, break something on purpose (change the hash function to MD5, remove zlib compression), and see what fails. Watch how Git's assumptions about immutability and content-addressing cascade through the system. That's how you really understand it.
Final Thoughts
Building this taught me more about Git in a week than years of using it did. I now understand why commits are cheap (just pointers to trees), how deduplication works (content-addressable storage), and why branching is fast (just moving a pointer).
If you want to truly understand a tool, build it yourself.
Built with Go, no Git libraries used. All hashing, compression, and object storage is custom implementation.
Check out more of my work at devuthman.vercel.app | GitHub
Top comments (0)